You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zheh12 <gi...@git.apache.org> on 2018/05/07 10:12:25 UTC

[GitHub] spark pull request #21257: [SPARK-24194] HadoopFsRelation cannot overwrite a...

GitHub user zheh12 opened a pull request:

    https://github.com/apache/spark/pull/21257

    [SPARK-24194] HadoopFsRelation cannot overwrite a path that is also b…

    ## What changes were proposed in this pull request?
    
    When insert overwrite in a parquet table. There will be a error check 
    
    ```
          if (overwrite) DDLUtils.verifyNotReadPath(actualQuery, outputPath)
    ```
    But we can do this for a hive table.
    
    The reason why we can't overwrite a **HadoopFsRelation** with output same as input is we delete the output path first. I think we can fix this with latter delete, just mark path should be deleted after the job 
    commit.  
    
    ## How was this patch tested?
    
    I change the test code **InsertSuite** and **MetastoreDataSourceSuite**. They now are input and output table can be same test. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zheh12/spark SPARK-24194

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21257
    
----

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186707079
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    add an `overwrite: Boolean` parameter to `FileCommitProtocol.deleteWithJob`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186599760
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    First of all, if it is the root directory of the table, I must record all the files in the directory, and wait until the job is commited to delete. Because the `_temporary` of the entire job is also in the directory, I cannot directly delete the entire directory.
    
    Second, when we record the files that need to be deleted, we just list the files in the root directory non-recursively. Under normal circumstances, the number of files in the first-level directory of the partition table will not be too much.
    
    In the end, this will certainly be slower than directly deleting the entire directory, but under the current implementation, we cannot directly delete the entire table directory.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186936032
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    That's a good idea. I change my code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90632/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90400/testReport)** for PR 21257 at commit [`a51620b`](https://github.com/apache/spark/commit/a51620b7664ab75442ffded955a826006885cdee).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r197176292
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +210,23 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +    if (fs.exists(staticPrefixPath)) {
    +      if (staticPartitionPrefix.isEmpty && outputCheck) {
    +        // input contain output, only delete output sub files when job commit
    +          val files = fs.listFiles(staticPrefixPath, false)
    --- End diff --
    
    if there are a lot of files here, you've gone from a dir delete which was O(1) on a fileystem, probably O(descendant) on an object store to at O(children) on an FS, O(children * descendants (chlld)) op here.  Not significant for a small number of files, but could potentially be expensive. Why do the iteration at all?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r197177156
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -235,4 +244,41 @@ class HadoopMapReduceCommitProtocol(
           tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false)
         }
       }
    +
    +  /**
    +   * now just record the file to be delete
    +   */
    +  override def deleteWithJob(fs: FileSystem, path: Path,
    --- End diff --
    
    No need to worry about concurrent access here, correct? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r187960677
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala ---
    @@ -120,7 +120,8 @@ abstract class FileCommitProtocol {
        * Specifies that a file should be deleted with the commit of this job. The default
        * implementation deletes the file immediately.
        */
    -  def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean): Boolean = {
    +  def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean,
    --- End diff --
    
    In the current situation we can delete it, but I feel it better to use a default value true.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90632/testReport)** for PR 21257 at commit [`6821795`](https://github.com/apache/spark/commit/68217952e25c2eef0064c433fe78e1a2240cb659).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r188262174
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -163,6 +170,15 @@ class HadoopMapReduceCommitProtocol(
       }
     
       override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = {
    +    // first delete the should delete special file
    +    for (fs <- pathsToDelete.keys) {
    +      for (path <- pathsToDelete(fs)) {
    +        if (fs.exists(path)) {
    +          fs.delete(path, true)
    --- End diff --
    
    1. you don't need to do the exists check, it's just overhead. delete() will return false if there was nothing to delete.
    2. But...what if that delete throws an exception? Should the commit fail (as it does now?), or downgraded. As an example, the hadoop `FileOutputCommtter` uses the option `"mapreduce.fileoutputcommitter.cleanup-failures.ignored` to choose what to do there
    3. ...and: what about cleanup in an abort job?
    
    I think you'd be best off isolating this cleanup into its own method and call from both job commit & job abort, in job commit discuss with others what to do, and in job abort just log & continue


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92165/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    ping @zheh12 to address comments. I am going to suggest to close this for now while I am identifying PRs to close now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r187930560
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -163,6 +169,12 @@ class HadoopMapReduceCommitProtocol(
       }
     
       override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = {
    +    // first delete the should delete special file
    +    val committerFs = jobContext.getWorkingDirectory.getFileSystem(jobContext.getConfiguration)
    --- End diff --
    
    StagingDir is not always be valid hadoop path, but the JobContext work dir always be.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186600573
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    have you considered the approach taken by `dynamicPartitionOverwrite`? i.e. using staging directory.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90957/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90619 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90619/testReport)** for PR 21257 at commit [`6821795`](https://github.com/apache/spark/commit/68217952e25c2eef0064c433fe78e1a2240cb659).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r197172859
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -235,4 +244,41 @@ class HadoopMapReduceCommitProtocol(
           tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false)
         }
       }
    +
    +  /**
    +   * now just record the file to be delete
    +   */
    +  override def deleteWithJob(fs: FileSystem, path: Path,
    +      canDeleteNow: Boolean = true): Boolean = {
    +    if (canDeleteNow) {
    +      super.deleteWithJob(fs, path)
    +    } else {
    +      val set = if (pathsToDelete.contains(fs)) {
    +        pathsToDelete(fs)
    +      } else {
    +        new mutable.HashSet[Path]()
    +      }
    +
    +      set.add(path)
    +      pathsToDelete.put(fs, set)
    +      true
    +    }
    +  }
    +
    +  private def cleanPathToDelete(): Unit = {
    +    // first delete the should delete special file
    +    for (fs <- pathsToDelete.keys) {
    +      for (path <- pathsToDelete(fs)) {
    +        try {
    +          if (!fs.delete(path, true)) {
    +            logWarning(s"Delete path ${path} fail at job commit time")
    --- End diff --
    
    delete -> false just means there was nothing there, I wouldn't warn at that point. Unless `delete()` throws an exception you assume that when the call returns, `fs.exists(path)` does not hold -regardless of the return value. (Special exception, the dest is "/")


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90413 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90413/testReport)** for PR 21257 at commit [`a51620b`](https://github.com/apache/spark/commit/a51620b7664ab75442ffded955a826006885cdee).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90636 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90636/testReport)** for PR 21257 at commit [`803b0a0`](https://github.com/apache/spark/commit/803b0a087ba223f4a1d001a705e56fa94ae9f212).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r187929156
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -898,4 +898,12 @@ object DDLUtils {
             "Cannot overwrite a path that is also being read from.")
         }
       }
    +
    +  def verifyReadPath(query: LogicalPlan, outputPath: Path): Boolean = {
    --- End diff --
    
    isInReadPath or inReadPath or isReadPath better?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #92165 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92165/testReport)** for PR 21257 at commit [`89599e6`](https://github.com/apache/spark/commit/89599e6bf6101e8731ba7c20641bd1cc254278b1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r188211717
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala ---
    @@ -120,7 +120,8 @@ abstract class FileCommitProtocol {
        * Specifies that a file should be deleted with the commit of this job. The default
        * implementation deletes the file immediately.
        */
    -  def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean): Boolean = {
    +  def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean,
    --- End diff --
    
    I will remove the `recursive` parameter.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] HadoopFsRelation cannot overwrite a path t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r197173180
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -235,4 +244,41 @@ class HadoopMapReduceCommitProtocol(
           tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false)
         }
       }
    +
    +  /**
    +   * now just record the file to be delete
    +   */
    +  override def deleteWithJob(fs: FileSystem, path: Path,
    +      canDeleteNow: Boolean = true): Boolean = {
    +    if (canDeleteNow) {
    +      super.deleteWithJob(fs, path)
    +    } else {
    +      val set = if (pathsToDelete.contains(fs)) {
    +        pathsToDelete(fs)
    +      } else {
    +        new mutable.HashSet[Path]()
    +      }
    +
    +      set.add(path)
    +      pathsToDelete.put(fs, set)
    +      true
    +    }
    +  }
    +
    +  private def cleanPathToDelete(): Unit = {
    +    // first delete the should delete special file
    +    for (fs <- pathsToDelete.keys) {
    +      for (path <- pathsToDelete(fs)) {
    +        try {
    +          if (!fs.delete(path, true)) {
    +            logWarning(s"Delete path ${path} fail at job commit time")
    +          }
    +        } catch {
    +          case ex: IOException =>
    +            throw new IOException(s"Unable to clear output " +
    +                s"file ${path} at job commit time", ex)
    --- End diff --
    
    recommend including ex.toString() in the new exception raised, as child exception text can often get lost


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90826 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90826/testReport)** for PR 21257 at commit [`6d496b1`](https://github.com/apache/spark/commit/6d496b1f6c7cbca2021c671c54d704760a2e7d7e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r189422931
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -163,6 +170,15 @@ class HadoopMapReduceCommitProtocol(
       }
     
       override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = {
    +    // first delete the should delete special file
    +    for (fs <- pathsToDelete.keys) {
    +      for (path <- pathsToDelete(fs)) {
    +        if (fs.exists(path)) {
    +          fs.delete(path, true)
    --- End diff --
    
    I think we should not delete the data when the task is aborted. The semantics of
    `descriptionWithJob` should be to delete the data when the `Job` is commited.
    I change code for handling exceptions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r189719775
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala ---
    @@ -120,8 +120,9 @@ abstract class FileCommitProtocol {
        * Specifies that a file should be deleted with the commit of this job. The default
        * implementation deletes the file immediately.
        */
    -  def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean): Boolean = {
    -    fs.delete(path, recursive)
    +  def deleteWithJob(fs: FileSystem, path: Path,
    +    canDeleteNow: Boolean = true): Boolean = {
    --- End diff --
    
    Nit: style issue. Can we follow the indents in the method declaration? https://github.com/databricks/scala-style-guide#spacing-and-indentation


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    some overall thought
    
    * I think this is only happening on a successful job commit, not abort. This is the desired action?
    * if something goes wrong here, is failing the entire job the correct action? If the deletes were happening earlier, then yes, the job would obviously fail. But now the core work has taken place, it's just cleanup failing. Which could be: permissions, transient network, etc. 
    
    I'll have to look a bit closer at what happens in committer cleanups right now, though as they are focused on rm -f $dest/__temporary/$jobAttempt, they are less worried about failures here as it shoudn't be changing any public datasets


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186608143
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    We choose to postpone deletion. Whether or not `output` is the same as `input`,
    now the `_temporary` directory is created in the `output` directory before deletion,
    so that it is not possible to delete the root directory directly.
    
    The original implementation was able to delete the root directory directly because it was deleted before the job was created, and then the root directory was rebuilt. Then the `_temporary` directory was created. Failure of any `task` in `job` in the original implementation will result in the loss of `output` data.
    
     I can't figure out how to separate the two situations. Do you have any good ideas?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186605405
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -235,4 +247,20 @@ class HadoopMapReduceCommitProtocol(
           tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false)
         }
       }
    +
    +  /**
    +   * now just record the file to be delete
    +   */
    +  override def deleteWithJob(fs: FileSystem,
    --- End diff --
    
    nit: code style. see https://github.com/apache/spark/pull/21257/files#diff-d97cfb5711116287a7655f32cd5675cbR132


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90413/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r187959698
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -163,6 +169,12 @@ class HadoopMapReduceCommitProtocol(
       }
     
       override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = {
    +    // first delete the should delete special file
    +    val committerFs = jobContext.getWorkingDirectory.getFileSystem(jobContext.getConfiguration)
    --- End diff --
    
    I change my code. 
    I now record every `FileSystem` will delete the path with a map structure. And Don't assume that they will use the same `FileSystem`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r187954805
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -235,4 +247,23 @@ class HadoopMapReduceCommitProtocol(
           tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false)
         }
       }
    +
    +  /**
    +   * now just record the file to be delete
    +   */
    +  override def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean,
    +    canDeleteNow: Boolean = true): Boolean = {
    +    if (canDeleteNow) {
    +      super.deleteWithJob(fs, path, recursive)
    +    } else {
    +      pathsToDelete.add(path -> recursive)
    +    }
    +  }
    +
    +  private def deletePath(fs: FileSystem, path: Path, recursive: Boolean): Unit = {
    +    if (fs.exists(path) && !fs.delete(path, recursive)) {
    +      throw new IOException(s"Unable to clear output " +
    +        s"directory $path")
    +    }
    --- End diff --
    
    I'd personally ignore a failure on delete(), as the conditions for the API call are "if this doesn't raise an exception then the dest is gone". You can skip the exists check as it will be superfluous


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] HadoopFsRelation cannot overwrite a path t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186927477
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    ok, then how about adding a new parameter `canDeleteNow: Boolean` to `FileCommitProtocol.deleteWithJob`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r187903129
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala ---
    @@ -120,7 +120,8 @@ abstract class FileCommitProtocol {
        * Specifies that a file should be deleted with the commit of this job. The default
        * implementation deletes the file immediately.
        */
    -  def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean): Boolean = {
    +  def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean,
    --- End diff --
    
    seems the `recursive` is always passed as true? can we remove it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90593 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90593/testReport)** for PR 21257 at commit [`273b1af`](https://github.com/apache/spark/commit/273b1afc96030355a4a289e9e070f141c6bcd406).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90593 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90593/testReport)** for PR 21257 at commit [`273b1af`](https://github.com/apache/spark/commit/273b1afc96030355a4a289e9e070f141c6bcd406).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90619/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90632/testReport)** for PR 21257 at commit [`6821795`](https://github.com/apache/spark/commit/68217952e25c2eef0064c433fe78e1a2240cb659).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Update the PR title? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r188186699
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala ---
    @@ -120,7 +120,8 @@ abstract class FileCommitProtocol {
        * Specifies that a file should be deleted with the commit of this job. The default
        * implementation deletes the file immediately.
        */
    -  def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean): Boolean = {
    +  def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean,
    --- End diff --
    
    Is there any (potential) cases we need a `recursive` parameter?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r187904340
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -163,6 +169,12 @@ class HadoopMapReduceCommitProtocol(
       }
     
       override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = {
    +    // first delete the should delete special file
    +    val committerFs = jobContext.getWorkingDirectory.getFileSystem(jobContext.getConfiguration)
    --- End diff --
    
    will this be different from `stagingDir.getFileSystem(jobContext.getConfiguration)`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90619 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90619/testReport)** for PR 21257 at commit [`6821795`](https://github.com/apache/spark/commit/68217952e25c2eef0064c433fe78e1a2240cb659).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #92165 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92165/testReport)** for PR 21257 at commit [`89599e6`](https://github.com/apache/spark/commit/89599e6bf6101e8731ba7c20641bd1cc254278b1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21257


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186915569
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    The key is that the data is already in the `output` directory before committing job, and we can't delete the `output` directory anymore.
    
    We overloaded `FileCommitProtocol` in the `HadoopMapReduceCommitProtocol` with the `deleteWithJob` method. Now it will not delete the file immediately, but it will wait until the entire job is committed.
    
    We did delete the files with committed the job, but the temporary output files were generated when the task was started. These temporary output files are in the `output` directory.  And the data will be move out to the `output` directory. 
    
    After the job starts, there is no safe time to delete the entire `output` directory.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90826/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r197174835
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +210,23 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +    if (fs.exists(staticPrefixPath)) {
    +      if (staticPartitionPrefix.isEmpty && outputCheck) {
    +        // input contain output, only delete output sub files when job commit
    +          val files = fs.listFiles(staticPrefixPath, false)
    +          while (files.hasNext) {
    +            val file = files.next()
    +            if (!committer.deleteWithJob(fs, file.getPath, false)) {
    +              throw new IOException(s"Unable to clear output " +
    --- End diff --
    
    as `committer.deleteWithJob()` returns true in base class, that check won't do much, at least not with the default impl. Probably better just to have `deleteWithJob()` return Unit, require callers to raise an exception on a delete failure. Given that delete() is required to say "dest doesn't exist if you return", I don't think they need to do any checks at all


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    cc @ericl 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186605552
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    > deleteMatchingPartitions happend only if overwrite is specified.
    
    I mean, we should only do it if overwriting the same table.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90957 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90957/testReport)** for PR 21257 at commit [`f4f329c`](https://github.com/apache/spark/commit/f4f329cbe8c5f5d9512cec91afa63d068eeddb45).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90826 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90826/testReport)** for PR 21257 at commit [`6d496b1`](https://github.com/apache/spark/commit/6d496b1f6c7cbca2021c671c54d704760a2e7d7e).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90636/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186600352
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    is it possible to only do it with overwrite? we should not introduce perf regression when not necessary.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186609102
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -235,4 +247,20 @@ class HadoopMapReduceCommitProtocol(
           tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false)
         }
       }
    +
    +  /**
    +   * now just record the file to be delete
    +   */
    +  override def deleteWithJob(fs: FileSystem,
    --- End diff --
    
    I have changed this code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186772705
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    we will delete files just before committing job, do we?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    cc @rxin @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186719888
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    If I do this, when the job is committed, it will delete the entire `output` directory. And there will be no data. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r187905010
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -898,4 +898,12 @@ object DDLUtils {
             "Cannot overwrite a path that is also being read from.")
         }
       }
    +
    +  def verifyReadPath(query: LogicalPlan, outputPath: Path): Boolean = {
    --- End diff --
    
    `isReadPath`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    cc @cloud-fan, Jenkins has some error, please help me retest, thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90636/testReport)** for PR 21257 at commit [`803b0a0`](https://github.com/apache/spark/commit/803b0a087ba223f4a1d001a705e56fa94ae9f212).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r187953870
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -163,6 +169,12 @@ class HadoopMapReduceCommitProtocol(
       }
     
       override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = {
    +    // first delete the should delete special file
    +    val committerFs = jobContext.getWorkingDirectory.getFileSystem(jobContext.getConfiguration)
    --- End diff --
    
    I'm not sure you can guarantee that the working dir is always the dest FS. At least with @rdblue's committers, task attempts work dirs are in file:// & task commit (somehow) gets them to the destFS in a form where job commit will make them visible.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90413 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90413/testReport)** for PR 21257 at commit [`a51620b`](https://github.com/apache/spark/commit/a51620b7664ab75442ffded955a826006885cdee).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186603145
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    1. From the code point of view, the current implementation is `deleteMatchingPartitions` happend only if `overwrite` is specified.
    2. Using `dynamicPartitionOverwrite` will not solve this problem,because it will also generate a `.stage` directory under the table root directory. We still need to record all the files we want to delete, but we cannot directly delete the root directories.
    The dynamic partition overwrite is actually recording all the partitions that need to be deleted and then deleted one by one. And the entire table `overwrite` deletes all the data of the entire directory, it needs to record all deleted partition directory files,so in fact the implementation of the code is similar with `dynamicPartitionOverwrite` .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r187935923
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala ---
    @@ -163,6 +169,12 @@ class HadoopMapReduceCommitProtocol(
       }
     
       override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = {
    +    // first delete the should delete special file
    +    val committerFs = jobContext.getWorkingDirectory.getFileSystem(jobContext.getConfiguration)
    --- End diff --
    
    can we change other places in this method to use the `fs` created here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90593/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r197176461
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +210,23 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +    if (fs.exists(staticPrefixPath)) {
    +      if (staticPartitionPrefix.isEmpty && outputCheck) {
    +        // input contain output, only delete output sub files when job commit
    +          val files = fs.listFiles(staticPrefixPath, false)
    +          while (files.hasNext) {
    +            val file = files.next()
    +            if (!committer.deleteWithJob(fs, file.getPath, false)) {
    +              throw new IOException(s"Unable to clear output " +
    +                s"directory ${file.getPath} prior to writing to it")
    +            }
    +          }
    +      } else {
    +        if (!committer.deleteWithJob(fs, staticPrefixPath, true)) {
    +          throw new IOException(s"Unable to clear output " +
    --- End diff --
    
    again, hard to see how this exception path would be reached.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90400/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90400 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90400/testReport)** for PR 21257 at commit [`a51620b`](https://github.com/apache/spark/commit/a51620b7664ab75442ffded955a826006885cdee).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186411390
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. /table/foo=1)
         val staticPrefixPath = qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    this deletes leaf files one by one, have you evaluated the performance difference?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwrite a p...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21257
  
    **[Test build #90957 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90957/testReport)** for PR 21257 at commit [`f4f329c`](https://github.com/apache/spark/commit/f4f329cbe8c5f5d9512cec91afa63d068eeddb45).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org