You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/18 11:28:38 UTC

[GitHub] [spark] symious opened a new pull request #35569: SPARK-38250. Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

symious opened a new pull request #35569:
URL: https://github.com/apache/spark/pull/35569

### What changes were proposed in this pull request?

In `commitJob` and `abortJob` of HadoopMapReduceCommitProtocol, the stagingDir was deleted, but for cases like the following example, the stagingDir was not created in the first place.

If the underLayer FileSystem is HDFS, there won't be any logs, but deleting a nonexistent file with FileSystem.delete would also acquire the WRITE LOCK of NameNode, if such operations are frequent, it would hurt the performance of NN.
For other FileSystem, like Alluxio, a warning log was thrown as follows.

AbstractFileSystem: delete failed: alluxio.exception.FileDoesNotExistException: Path "/test/spark/t10/.spark-staging-3cfe0d7c-3749-44de-9da7-f811a40aa4af" does not exist. This ticket is to add an existence check before the actual deleting operation, which could help eliminate the WARN log or not hurt HDFS' performance.

### Why are the changes needed?

The original case might hurt HDFS' performance and have WARNING logs for Alluxio.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

unit test.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1048478579


   cc @steveloughran FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] steveloughran commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

steveloughran commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r813216159



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -305,4 +304,11 @@ class HadoopMapReduceCommitProtocol(
         logWarning(s"Exception while aborting ${taskContext.getTaskAttemptID}", e)
     }
   }
+
+  private def cleanStagingDir(jobContext: JobContext): Unit = {
+    val fs = stagingDir.getFileSystem(jobContext.getConfiguration)
+    if (fs.exists(stagingDir)) {

Review comment:
       please don't use delete on exit. that delete-on-shutdown operation is slow if called, and will delete what is there even if someone added a new file there after your operation. stick to using it for test teardown only.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #35569: SPARK-38250. Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1046083352


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r810943057



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -305,4 +304,11 @@ class HadoopMapReduceCommitProtocol(
         logWarning(s"Exception while aborting ${taskContext.getTaskAttemptID}", e)
     }
   }
+
+  private def cleanStagingDir(jobContext: JobContext): Unit = {
+    val fs = stagingDir.getFileSystem(jobContext.getConfiguration)
+    if (fs.exists(stagingDir)) {

Review comment:
       > This needs two RPC calls to remove the directory now, isn't it also a performance issue?
   > 
   > BTW, does `FileSystem` provide an API with "delete if exists" semantic?
   
   I thin here we can use `deleteOnExit` like `SaveAsHiveFile`
   ```
   fs.deleteOnExit(dirPath)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] symious commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

symious commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r810980252



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -305,4 +304,11 @@ class HadoopMapReduceCommitProtocol(
         logWarning(s"Exception while aborting ${taskContext.getTaskAttemptID}", e)
     }
   }
+
+  private def cleanStagingDir(jobContext: JobContext): Unit = {
+    val fs = stagingDir.getFileSystem(jobContext.getConfiguration)
+    if (fs.exists(stagingDir)) {

Review comment:
       I think the stagingDir is not created`HadoopMapReduceCommitProtocol` or its subclass, it's just noting the directory and the actual `OutputCommiter` will do the creation.
   
   Unless we are asking `HadoopMapReduceCommitProtocol` and `SQLHadoopMapReduceCommitProtocol` to create the stagingDir at the following situation.
   ```
      * The staging directory of this write job. Spark uses it to deal with files with absolute output
      * path, or writing data into partitioned directory with dynamicPartitionOverwrite=true.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] symious commented on pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

symious commented on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1047055079


   @cloud-fan Changed to use "filesToMove" to check the absolute output path from TaskCommits.
   
   Since we can not get TaskCommits in `abortJob`, I added a check before deleting. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] symious commented on pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

symious commented on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1046753354


   @cloud-fan @AngersZhuuuu 
   Updated to use a variable to indicate the existence of stagingDir, could you help to check?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] symious commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

symious commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r812540029



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -157,6 +159,7 @@ class HadoopMapReduceCommitProtocol(
     // Include a UUID here to prevent file collisions for one task writing to different dirs.
     // In principle we could include hash(absoluteDir) instead but this is simpler.
     val tmpOutputPath = new Path(stagingDir, UUID.randomUUID().toString() + "-" + filename).toString
+    stagingDirExists = true

Review comment:
       Updated the checking in `commitJob`, but since result can't be retrieved in `abortJob`, so check before delete is implemented in `abortJob`.
   @cloud-fan Could you help to check?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] symious commented on pull request #35569: SPARK-38250. Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

symious commented on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1044576655


   @cloud-fan Could you help to review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r810972568



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -305,4 +304,11 @@ class HadoopMapReduceCommitProtocol(
         logWarning(s"Exception while aborting ${taskContext.getTaskAttemptID}", e)
     }
   }
+
+  private def cleanStagingDir(jobContext: JobContext): Unit = {
+    val fs = stagingDir.getFileSystem(jobContext.getConfiguration)
+    if (fs.exists(stagingDir)) {

Review comment:
       > One thing about this method is it's used when creating the path, notifying the FileSystem to delete when FileSystem is closed. A little different from the current use of `fs.delete(stagingDir)`.
   
   I mean we can do like `SaveAsHiveFile`, mkdir first and let it delete when system exit. Since `deleteOnExit` will check file exit when deleting.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] steveloughran edited a comment on pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

steveloughran edited a comment on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1049106381

* alluxio shouldn't be complaining that the file isn't there. delete(path) must return true as the requirement "path is not present when we return" is met.
* removing checks before delete() saves one round trip when working with object stores.
* valid point about namenode lock overheads, but not something i personally worry too much about. a lock of some form may be needed for the exists probe too, and you've now got two RPCs. if the situation was that most times you did the call the path you wanted to delete wasn't there then maybe it could be justified, otherwise it adds 1 call per operation.

overall then, -1 to this, though it does sound like alluxio is overreacting.

> why doesn't fs.delete check the existence on the server side, I think the following ideas might be related.

it doesn't consider deleting a nonexistent path to be an error. after you finish the call, the path you passed in isn't there, which is the outcome it is trying to offer.

> Similarly, for users of FileSystem, maybe some FileSystems do check the existence before deleting like Alluxio, but, IMHO, we can't ask all the FileSystem to do the same, it's better to do the check on users' side.

why?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] symious commented on pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

symious commented on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1049345319

@steveloughran Thanks for the review.

I do agree with the overhead incurred by the added RPC of check existence. So the latest version of commit's idea is

- `commitJob`: delete the directory if we are sure that the directory exists, that is, ` if (dynamicPartitionOverwrite || filesToMove.nonEmpty)`, so we don't need to do the check when the job is running successfully
- `abortJob`: in this case, we can not decide if we have created the directory, and the frequency of invoking `abortJob` shouldn't be too much, so the check existence before deleting was used.

In fact, this ticket was brought up by the warning log of Alluxio. Since we are migrating some jobs to Alluxio, when I saw the warning log, I'd assume there might be something wrong with Alluxio, as HDFS didn't give back the logs like this. Even after checking the code of Spark, I still thought the problem comes from Alluxio. After deeper debugging, I found Alluxio is reporting the correct thing, although HDFS does return a result of "false" back to users.

I think the Boolean value return by `fs.delete` is a little blur, it can be returned when we try to delete a nonexistent file, or we are trying to delete a file out of our permission, or something else. Not even to mention we didn't even process the result of this delete.

Maybe it's the no-warning from HDFS and not-handling the result of deletion that gives the idea that the directory does exist and Alluxio is reporting the incorrect thing. So I think the point has changed from overhead and warning log to the mislead the delete has brought us, the sole ` fs.delete(stagingDir, true)` gives the incorrect idea that the stagingDir is always there.

I even think the `check existence` in `abortJob` is not that important if we add some comments like "we are not sure if the stagingDir exists, so we try our best to delete the stagingDir". Only in `commitJob`, we can decide if the stagingDir exists, so when the stagingDir is not there, we don't have to do the deletion, so users should know the stagingDir wasn't a must-generated directory here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] symious commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

symious commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r810921597



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -305,4 +304,11 @@ class HadoopMapReduceCommitProtocol(
         logWarning(s"Exception while aborting ${taskContext.getTaskAttemptID}", e)
     }
   }
+
+  private def cleanStagingDir(jobContext: JobContext): Unit = {
+    val fs = stagingDir.getFileSystem(jobContext.getConfiguration)
+    if (fs.exists(stagingDir)) {

Review comment:
       @cloud-fan Thanks for the review. 
   The overhead of new added RPC call of readFile is quite small compared to the unnecessary deletion of the nonexistent file. Since In NameNode, read lock is shared and write lock is exclusive. 
   
   And I think checking before existence is more of a client-side design, when we are running "hadoop fs -rm hdfs://ns/file", the existence is checked before real delete too.
   
   In fact, I think it may be better to add a variable to indicate if the stagingDirectory is created so that we don't need the first RPC to check if it's existence. @cloud-fan What do you think?
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r811107313



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -157,6 +159,7 @@ class HadoopMapReduceCommitProtocol(
     // Include a UUID here to prevent file collisions for one task writing to different dirs.
     // In principle we could include hash(absoluteDir) instead but this is simpler.
     val tmpOutputPath = new Path(stagingDir, UUID.randomUUID().toString() + "-" + filename).toString
+    stagingDirExists = true

Review comment:
       This is called at the executor side. How does it work with the checking code in `commitJob` and `abortJob`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] symious commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

symious commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r813039532



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -236,7 +236,9 @@ class HadoopMapReduceCommitProtocol(
         }
       }
 
-      fs.delete(stagingDir, true)
+      if (dynamicPartitionOverwrite || filesToMove.nonEmpty) {
+        fs.delete(stagingDir, true)

Review comment:
       Thanks for the reply. I'll try to consult from HDFS community if there's an official document about this "check before write" operation.
   
   In my current scenario, the cons are 
   1. there will be a warning log when writing to Alluxio,
   2. since there's no log returning from HDFS, the user won't even notice that he's deleting a non-exist file, which may incur some other mistakes, say I want to check the size of ".spark-staging-xxx" or something else, but the directory doesn't exist at all. 
   3. the performance overhead mentioned above about the NameNode Write lock.
   
   For the question about why doesn't fs.delete check the existence on the server side, I think the following ideas might be related. 
   1. The succinct interface in FileSystem. So that "fs.delete" only do delete.
   2. HDFS Client or other clients already do the check, like when you running "hadoop fs -rmr hdfs://xxx/xxx", it will check first before really calling "fs.delete". So if another "fs.exists" was added in "fs.delete", for client already checked before, there will be 2 * "fs.exists" + 1 * "fs.delete" for them.
   3. I think it reminds me of the story of the King afraid of dirty feet asks to sweep all the ground of the country, in fact, the only thing the king needs to do is to wear shoes. Similarly, for users of FileSystem, maybe some FileSystems do check the existence before deleting like Alluxio, but, IMHO, we can't ask all the FileSystem to do the same, it's better to do the check on users' side.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] steveloughran commented on pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

steveloughran commented on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1049106381


   * alluxio shouldn't be complaining that the file isn't there. delete(path) must return true as the requirement "path is not present when we return" is met.
   * removing checks before delete() saves one round trip when working with object stores.
   * valid point about namenode lock overheads, but not something i personally worry too much about. a lock of some form may be needed for the exists probe too, and you've now got two RPCs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r810889474



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -305,4 +304,11 @@ class HadoopMapReduceCommitProtocol(
         logWarning(s"Exception while aborting ${taskContext.getTaskAttemptID}", e)
     }
   }
+
+  private def cleanStagingDir(jobContext: JobContext): Unit = {
+    val fs = stagingDir.getFileSystem(jobContext.getConfiguration)
+    if (fs.exists(stagingDir)) {

Review comment:
       This needs two RPC calls to remove the directory now, isn't it also a performance issue?
   
   BTW, does `FileSystem` provide an API with "delete if exists" semantic?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] symious commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

symious commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r810955956



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -305,4 +304,11 @@ class HadoopMapReduceCommitProtocol(
         logWarning(s"Exception while aborting ${taskContext.getTaskAttemptID}", e)
     }
   }
+
+  private def cleanStagingDir(jobContext: JobContext): Unit = {
+    val fs = stagingDir.getFileSystem(jobContext.getConfiguration)
+    if (fs.exists(stagingDir)) {

Review comment:
       One thing about this method is it's used when creating the path, notifying the FileSystem to delete when FileSystem is closed. A little different from the current use of `fs.delete(stagingDir)`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r811107313



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -157,6 +159,7 @@ class HadoopMapReduceCommitProtocol(
     // Include a UUID here to prevent file collisions for one task writing to different dirs.
     // In principle we could include hash(absoluteDir) instead but this is simpler.
     val tmpOutputPath = new Path(stagingDir, UUID.randomUUID().toString() + "-" + filename).toString
+    stagingDirExists = true

Review comment:
       This is called at the executor side.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r812969927



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -236,7 +236,9 @@ class HadoopMapReduceCommitProtocol(
         }
       }
 
-      fs.delete(stagingDir, true)
+      if (dynamicPartitionOverwrite || filesToMove.nonEmpty) {
+        fs.delete(stagingDir, true)

Review comment:
       is there any document talking about the best practice of `fs.delete`? If "checking before delete" is a best practice, I think we should just follow it. (another question is why `fs.delete` does not check existence by itself inside its implementation...)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r812969927



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -236,7 +236,9 @@ class HadoopMapReduceCommitProtocol(
         }
       }
 
-      fs.delete(stagingDir, true)
+      if (dynamicPartitionOverwrite || filesToMove.nonEmpty) {
+        fs.delete(stagingDir, true)

Review comment:
       is there any document talking about the best practice of `fs.delete`? If "checking before delete" is a best practice, I think we should just follow it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] steveloughran commented on pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

steveloughran commented on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1070807232


   i've been doing some benchmarking of cloud storage, and fwiw google GCS is pretty sluggish on things like delete and mkdir even when they are no ops. maybe just accept that the cost of the check is a bit higher on some than others?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] steveloughran commented on pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

Posted by GitBox <gi...@apache.org>.

steveloughran commented on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1070807232


   i've been doing some benchmarking of cloud storage, and fwiw google GCS is pretty sluggish on things like delete and mkdir even when they are no ops. maybe just accept that the cost of the check is a bit higher on some than others?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org