You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "warrenzhu25 (via GitHub)" <gi...@apache.org> on 2023/07/09 04:42:05 UTC

[GitHub] [spark] warrenzhu25 opened a new pull request, #41905: [SPARK-44126][CORE] Shuffle migration failure count should not increase when target executor decommissioned

warrenzhu25 opened a new pull request, #41905:
URL: https://github.com/apache/spark/pull/41905

   ### What changes were proposed in this pull request?
   Do not increase shuffle migration failure count when target executor decommissioned
   
   
   ### Why are the changes needed?
   Block manager decommissioner only sync with block manager master about live peers every . If some block manager decommissioned between this, it still try to migrated shuffle to such decommissioned block manger. The migration will be failed with RuntimeException("BlockSavedOnDecommissionedBlockManagerException"). Detailed stack trace as below:
   
   ```
   org.apache.spark.SparkException: Exception thrown in awaitResult:
       at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
       at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122)
       at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5(BlockManagerDecommissioner.scala:127)
       at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5$adapted(BlockManagerDecommissioner.scala:118)
       at scala.collection.immutable.List.foreach(List.scala:431)
       at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:118)    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
       at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       at java.base/java.lang.Thread.run(Thread.java:829)
   Caused by: java.lang.RuntimeException: org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block shuffle_2_6429_0.data cannot be saved on decommissioned executor
       at org.apache.spark.errors.SparkCoreErrors$.cannotSaveBlockOnDecommissionedExecutorError(SparkCoreErrors.scala:238)    at org.apache.spark.storage.BlockManager.checkShouldStore(BlockManager.scala:277)
       at org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:741)
       at org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:174)
   
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added UT in `BlockManagerDecommissionUnitSuite`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #41905: [SPARK-44126][CORE] Shuffle migration failure count should not increase when target executor decommissioned

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #41905:
URL: https://github.com/apache/spark/pull/41905#issuecomment-1732381545

   @Ngone51 @dongjoon-hyun Could you help merge this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #41905: [SPARK-44126][CORE] Shuffle migration failure count should not increase when target executor decommissioned

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on PR #41905:
URL: https://github.com/apache/spark/pull/41905#issuecomment-1627632433

   Will let @holdenk or @dongjoon-hyun review this ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41905: [SPARK-44126][CORE] Shuffle migration failure count should not increase when target executor decommissioned

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41905:
URL: https://github.com/apache/spark/pull/41905#issuecomment-1628325381

   cc @Ngone51 and @jiangxb1987 too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #41905: [SPARK-44126][CORE] Shuffle migration failure count should not increase when target executor decommissioned

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #41905:
URL: https://github.com/apache/spark/pull/41905#issuecomment-1627603374

   cc @dongjoon-hyun @holdenk @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 commented on a diff in pull request #41905: [SPARK-44126][CORE] Shuffle migration failure count should not increase when target executor decommissioned

Posted by "jiangxb1987 (via GitHub)" <gi...@apache.org>.
jiangxb1987 commented on code in PR #41905:
URL: https://github.com/apache/spark/pull/41905#discussion_r1258269305


##########
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:
##########
@@ -143,6 +146,11 @@ private[storage] class BlockManagerDecommissioner(
                     // have been used in the try-block above so there's no point trying again
                     && peer != FallbackStorage.FALLBACK_BLOCK_MANAGER_ID) {
                   fallbackStorage.foreach(_.copy(shuffleBlockInfo, bm))
+                } else if (e.getCause != null && e.getCause.getMessage != null

Review Comment:
   make this a new case statement?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #41905: [SPARK-44126][CORE] Shuffle migration failure count should not increase when target executor decommissioned

Posted by "Ngone51 (via GitHub)" <gi...@apache.org>.
Ngone51 commented on PR #41905:
URL: https://github.com/apache/spark/pull/41905#issuecomment-1734747428

   LGTM
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 closed pull request #41905: [SPARK-44126][CORE] Shuffle migration failure count should not increase when target executor decommissioned

Posted by "Ngone51 (via GitHub)" <gi...@apache.org>.
Ngone51 closed pull request #41905: [SPARK-44126][CORE] Shuffle migration failure count should not increase when target executor decommissioned
URL: https://github.com/apache/spark/pull/41905


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on a diff in pull request #41905: [SPARK-44126][CORE] Shuffle migration failure count should not increase when target executor decommissioned

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on code in PR #41905:
URL: https://github.com/apache/spark/pull/41905#discussion_r1335051555


##########
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:
##########
@@ -143,6 +146,11 @@ private[storage] class BlockManagerDecommissioner(
                     // have been used in the try-block above so there's no point trying again
                     && peer != FallbackStorage.FALLBACK_BLOCK_MANAGER_ID) {
                   fallbackStorage.foreach(_.copy(shuffleBlockInfo, bm))
+                } else if (e.getCause != null && e.getCause.getMessage != null

Review Comment:
   As this condition is complicated and nested, maybe keep what it is is easier to understand. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org