You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/29 05:55:31 UTC

[GitHub] [spark] xinglin opened a new pull request, #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

xinglin opened a new pull request, #38832:
URL: https://github.com/apache/spark/pull/38832

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This PR combines fixes for SPARK-3900 and SPARK-21138. Spark-3900 introduced a fix for illegalStateException when creating fs object in cleanupStagingDir. However, SPARK-21138 reverted that change when addressing filesystem mismatch authority (wrong fs) exception. We need both fixes. 
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinglin commented on a diff in pull request #38832: [WIP] SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
xinglin commented on code in PR #38832:
URL: https://github.com/apache/spark/pull/38832#discussion_r1036211437


##########
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##########
@@ -240,6 +240,9 @@ private[spark] class ApplicationMaster(
 
       logInfo("ApplicationAttemptId: " + appAttemptId)
 
+      // During shutdown, we may not be able to create an FileSystem object. So, pre-create here.
+      val stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
+      val stagingDirFs = stagingDirPath.getFileSystem(yarnConf)

Review Comment:
   Hi @srowen,
   
   let me take a look at addressing the possible leak. Ideally when ShutdownManager executes each shutdownhook, it should automatically remove their references. But that does not seem to be the case right now. 
   
   Mark this PR as WIP until we address/have a proposal for this concern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on pull request #38832: [WIP] SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
xkrogen commented on PR #38832:
URL: https://github.com/apache/spark/pull/38832#issuecomment-1335943553

   @xinglin , can you please enable GitHub Actions on your fork? The CI can't run until you do that (see [this](https://github.com/apache/spark/pull/38832/checks?check_run_id=9762733604) and the "Pull Request" section of the [contribution guide](https://spark.apache.org/contributing.html)).
   
   +1 on this change from me and nice find that this already existed but was (probably accidentally) reverted.
   
   cc @mridulm can you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinglin commented on a diff in pull request #38832: [WIP] SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
xinglin commented on code in PR #38832:
URL: https://github.com/apache/spark/pull/38832#discussion_r1038620628


##########
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##########
@@ -240,6 +240,9 @@ private[spark] class ApplicationMaster(
 
       logInfo("ApplicationAttemptId: " + appAttemptId)
 
+      // During shutdown, we may not be able to create an FileSystem object. So, pre-create here.
+      val stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
+      val stagingDirFs = stagingDirPath.getFileSystem(yarnConf)

Review Comment:
   the shutdownhook has a reference to the filesystem object, and Java won't GC/free an object until there is no more reference to it, right? 
   
   In ShutdownHookManager.java, it adds all hooks into a Set (`hooks`). But in `executeShutdown()` to execute each hook, it does not remove each reference from the Set. So, the references should still be stored in `hooks`. To remove the reference, I think we should either add `hooks.clear()` at the end of `executeShutdown()` or remove the hook from Spark. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinglin commented on a diff in pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
xinglin commented on code in PR #38832:
URL: https://github.com/apache/spark/pull/38832#discussion_r1035099293


##########
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##########
@@ -240,6 +240,9 @@ private[spark] class ApplicationMaster(
 
       logInfo("ApplicationAttemptId: " + appAttemptId)
 
+      // During shutdown, we may not be able to create an FileSystem object. So, pre-create here.
+      val stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
+      val stagingDirFs = stagingDirPath.getFileSystem(yarnConf)

Review Comment:
   The problem is we can not create a new filesystem object (this line ` val stagingDirFs = stagingDirPath.getFileSystem(yarnConf)`), if we are in shutdown and this is the first time we try to create a new filesystem object. We can not register a shutdown hook during a shutdown. That is why we should move the filesystem object creation outside of cleanupStagingDir itself. So, cleanupStagingDir called during shutdown won't try to create a filesystem object.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
srowen closed pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138
URL: https://github.com/apache/spark/pull/38832


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #38832:
URL: https://github.com/apache/spark/pull/38832#discussion_r1035109693


##########
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##########
@@ -240,6 +240,9 @@ private[spark] class ApplicationMaster(
 
       logInfo("ApplicationAttemptId: " + appAttemptId)
 
+      // During shutdown, we may not be able to create an FileSystem object. So, pre-create here.
+      val stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
+      val stagingDirFs = stagingDirPath.getFileSystem(yarnConf)

Review Comment:
   Oh I see, it's in a hook. Is it safe to hold on to the FS object like this in the hook? maybe so, just makes me slightly uneasy about leaks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinglin commented on pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
xinglin commented on PR #38832:
URL: https://github.com/apache/spark/pull/38832#issuecomment-1331073945

   > Those are all long-since fixed. This description doesn't make sense
   
   Updated the description to share more details. Essentially, the fix introduced in spark-3900 was undid/reverted in spark-21138 and we need to bring it back.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinglin commented on pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
xinglin commented on PR #38832:
URL: https://github.com/apache/spark/pull/38832#issuecomment-1331079870

   Hi @xkrogen,
   
   Please review this PR. Thanks,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinglin commented on a diff in pull request #38832: [WIP] SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
xinglin commented on code in PR #38832:
URL: https://github.com/apache/spark/pull/38832#discussion_r1038629936


##########
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##########
@@ -240,6 +240,9 @@ private[spark] class ApplicationMaster(
 
       logInfo("ApplicationAttemptId: " + appAttemptId)
 
+      // During shutdown, we may not be able to create an FileSystem object. So, pre-create here.
+      val stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
+      val stagingDirFs = stagingDirPath.getFileSystem(yarnConf)

Review Comment:
   got it. then I guess there is no concern here. Thanks,
   
   I guess this PR is ready to be merged then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #38832:
URL: https://github.com/apache/spark/pull/38832#discussion_r1035091858


##########
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##########
@@ -240,6 +240,9 @@ private[spark] class ApplicationMaster(
 
       logInfo("ApplicationAttemptId: " + appAttemptId)
 
+      // During shutdown, we may not be able to create an FileSystem object. So, pre-create here.
+      val stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
+      val stagingDirFs = stagingDirPath.getFileSystem(yarnConf)

Review Comment:
   I'm having trouble seeing how the logic is different before and after.  In both cases, the FS is obtained from `new Path(System.getenv("SPARK_YARN_STAGING_DIR"))` in the same way, no?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on a diff in pull request #38832: [WIP] SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
xkrogen commented on code in PR #38832:
URL: https://github.com/apache/spark/pull/38832#discussion_r1038623079


##########
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##########
@@ -240,6 +240,9 @@ private[spark] class ApplicationMaster(
 
       logInfo("ApplicationAttemptId: " + appAttemptId)
 
+      // During shutdown, we may not be able to create an FileSystem object. So, pre-create here.
+      val stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
+      val stagingDirFs = stagingDirPath.getFileSystem(yarnConf)

Review Comment:
   After shutdown hooks complete, the JVM exits. There is no need to perform GC.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on a diff in pull request #38832: [WIP] SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
xkrogen commented on code in PR #38832:
URL: https://github.com/apache/spark/pull/38832#discussion_r1038612370


##########
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##########
@@ -240,6 +240,9 @@ private[spark] class ApplicationMaster(
 
       logInfo("ApplicationAttemptId: " + appAttemptId)
 
+      // During shutdown, we may not be able to create an FileSystem object. So, pre-create here.
+      val stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
+      val stagingDirFs = stagingDirPath.getFileSystem(yarnConf)

Review Comment:
   > Is it safe to hold on to the FS object like this in the hook?
   
   I don't see why this would be an issue. It adds one additional reference to the state required for the hook, but this is negligible. FS objects are already globally cached, so in a typical case there is no new/additional FS instance being held onto by the hook; just a reference to the same FS object which would be used elsewhere throughout the driver. If `SPARK_YARN_STAGING_DIR` happens to be on a different file system than everything else, you'll end up with an additional FS reference, but most FS implementations (including HDFS) initialize all of their external connections lazily (upon the time of the first RPC) so you're just talking about a small amount of in-memory state.
   
   > But that does not seem to be the case right now.
   
   Can you elaborate @xinglin?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38832:
URL: https://github.com/apache/spark/pull/38832#issuecomment-1335979262

   I think it's OK. I don't know enough to evaluate the logic of the change, but seems plausible to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on a diff in pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
xkrogen commented on code in PR #38832:
URL: https://github.com/apache/spark/pull/38832#discussion_r1038642590


##########
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##########
@@ -240,6 +240,9 @@ private[spark] class ApplicationMaster(
 
       logInfo("ApplicationAttemptId: " + appAttemptId)
 
+      // During shutdown, we may not be able to create an FileSystem object. So, pre-create here.
+      val stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
+      val stagingDirFs = stagingDirPath.getFileSystem(yarnConf)

Review Comment:
   @srowen does this discussion address any concerns you had?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38832:
URL: https://github.com/apache/spark/pull/38832#issuecomment-1336425789

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38832:
URL: https://github.com/apache/spark/pull/38832#issuecomment-1330672720

   Those are all long-since fixed. This description doesn't make sense


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38832: SPARK-41313 Combine fixes for SPARK-3900 and SPARK-21138

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38832:
URL: https://github.com/apache/spark/pull/38832#issuecomment-1330776357

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org