You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "cxzl25 (via GitHub)" <gi...@apache.org> on 2023/07/03 03:14:38 UTC

[GitHub] [spark] cxzl25 commented on pull request #40972: [SPARK-43301][CORE][SHUFFLE] BlockStoreClient getHostLocalDirs RPC supports IOException retry

cxzl25 commented on PR #40972:
URL: https://github.com/apache/spark/pull/40972#issuecomment-1617158871

   Thanks for the comments everyone, sorry for the delayed reply.
   
   @jiangxb1987 @mridulm 
   
   In the production environment, even though the communication is the same host, some `IOExceptions` may occur, such as connection timeout, connection reset, resulting in `FetchFailedException`.
   
   The logic of retrying here refers to `RetryingBlockTransferor#shouldRetry`.
   https://github.com/apache/spark/blob/8483191dfbb2842c10d5e8d2409eafed6dfc4abd/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L215-L231
   
   In this PR, I want to improve the stability of `fetchHostLocalBlocks` in the shuffle read process.
   
   Some production environment running logs
   ```java
   org.apache.spark.shuffle.FetchFailedException
   at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1170)
   Caused by: java.io.IOException: Connecting to XXX timed out (600000 ms)
   	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285)
   	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
   	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
   	at org.apache.spark.network.shuffle.BlockStoreClient.getHostLocalDirs(BlockStoreClient.java:164)
   	at org.apache.spark.storage.HostLocalDirManager.getHostLocalDirs(BlockManager.scala:151)
   	at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchHostLocalBlocks$4(ShuffleBlockFetcherIterator.scala:639)
   	at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchHostLocalBlocks$4$adapted(ShuffleBlockFetcherIterator.scala:638)
   	at scala.collection.immutable.List.foreach(List.scala:431)
   	at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchHostLocalBlocks(ShuffleBlockFetcherIterator.scala:638)
   	at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchAllHostLocalBlocks$1(ShuffleBlockFetcherIterator.scala:720)
   	at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchAllHostLocalBlocks$1$adapted(ShuffleBlockFetcherIterator.scala:720)
   	at scala.Option.foreach(Option.scala:407)
   	at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchAllHostLocalBlocks(ShuffleBlockFetcherIterator.scala:720)
   	at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:712)
   	at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:191)
   	at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89)
   	at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:240)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org