You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/30 20:21:06 UTC

[GitHub] [spark] warrenzhu25 opened a new pull request, #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

warrenzhu25 opened a new pull request, #38852:
URL: https://github.com/apache/spark/pull/38852

   ### What changes were proposed in this pull request?
   Wait shuffle fetch to finish when decommission executor by checking num of opening streams.
   
   ### Why are the changes needed?
   Avoid fetch failed caused by pending shuffle fetches when executor self-exit after decommission
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1734644782

   @holdenk @dongjoon-hyun @Ngone51 Help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1500375839

   @holdenk @dongjoon-hyun @Ngone51 Help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a diff in pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on code in PR #38852:
URL: https://github.com/apache/spark/pull/38852#discussion_r1036661551


##########
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala:
##########
@@ -352,8 +353,20 @@ private[spark] class CoarseGrainedExecutorBackend(
                 // We can only trust allBlocksMigrated boolean value if there were no tasks running
                 // since the start of computing it.
                 if (allBlocksMigrated && (migrationTime > lastTaskRunningTime)) {
-                  logInfo("No running tasks, all blocks migrated, stopping.")
-                  exitExecutor(0, ExecutorLossMessage.decommissionFinished, notifyDriver = true)
+                  val pendingFetches = env.blockManager.getNumPendingBlockFetches()

Review Comment:
   Shouldn't the condition `executor.numRunningTasks == 0` avoid this situation already? If there are pending shuffle fetches, running tasks should be empty, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1367081140

   @holdenk @dongjoon-hyun @Ngone51 Help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by "Ngone51 (via GitHub)" <gi...@apache.org>.
Ngone51 commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1736616570

   So with https://github.com/apache/spark/pull/42296, as long as all the shuffle data has been migrated, I think the fetcher would be able to retry the shuffle fetch after the decommissioned executor is gone.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1627596678

   > Just following up @warrenzhu25 do you have the time to add a unit test?
   
   Thanks for looking at this. Any ideas where I should add UT to cover which case? The change about `BlockManager` and `BlockTransferService` is pretty straightforward, the only place might need UT is in `CoarseGrainedExecutorBackend`, but  it seems hard to mock `BlockManager` there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1332905185

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1516849997

   > I like this addition conceptually 👍 Shuffle fetches should be fairly short in general as well. Any thoughts on adding automated testing?
   
   Thanks for feedback, I'll add UT to cover this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on a diff in pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on code in PR #38852:
URL: https://github.com/apache/spark/pull/38852#discussion_r1036663332


##########
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala:
##########
@@ -352,8 +353,20 @@ private[spark] class CoarseGrainedExecutorBackend(
                 // We can only trust allBlocksMigrated boolean value if there were no tasks running
                 // since the start of computing it.
                 if (allBlocksMigrated && (migrationTime > lastTaskRunningTime)) {
-                  logInfo("No running tasks, all blocks migrated, stopping.")
-                  exitExecutor(0, ExecutorLossMessage.decommissionFinished, notifyDriver = true)
+                  val pendingFetches = env.blockManager.getNumPendingBlockFetches()

Review Comment:
   No, these pending fetches means other executors are fetching from this executor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor [spark]

Posted by "holdenk (via GitHub)" <gi...@apache.org>.
holdenk commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1754108618

   Maybe the `NettyBlockTransferServiceSuite` would be enough so we can assert that we are keeping track of active streams and also once done that the count is zero (so we know this is not blocking shutdown indefinitely). WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1899427841

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by "holdenk (via GitHub)" <gi...@apache.org>.
holdenk commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1516843515

   I like this addition conceptually 👍 Shuffle fetches should be fairly short in general as well. Any thoughts on adding automated testing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1516821269

   @holdenk @dongjoon-hyun @Ngone51 Help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by "holdenk (via GitHub)" <gi...@apache.org>.
holdenk commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1627542976

   Just following up @warrenzhu25 do you have the time to add a unit test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #38852:
URL: https://github.com/apache/spark/pull/38852#issuecomment-1332773585

   @holdenk @dongjoon-hyun Help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor
URL: https://github.com/apache/spark/pull/38852


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org