You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@celeborn.apache.org by "AngersZhuuuu (via GitHub)" <gi...@apache.org> on 2023/04/21 04:07:27 UTC

[GitHub] [incubator-celeborn] AngersZhuuuu opened a new pull request, #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled

AngersZhuuuu opened a new pull request, #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
   
   ```
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
   ```
   ```
   23/04/21 08:37:14 ERROR Executor: Exception in task 67.2 in stage 5.0 (TID 27521)
   com.aliyun.emr.rss.common.rpc.RpcTimeoutException: Futures timed out after [30 seconds]. This timeout is controlled by rss.rpc.lookupTimeout
   	at com.aliyun.emr.rss.common.rpc.RpcTimeout.com$aliyun$emr$rss$common$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:46)
   	at com.aliyun.emr.rss.common.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:61)
   	at com.aliyun.emr.rss.common.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:57)
   	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
   	at com.aliyun.emr.rss.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
   	at com.aliyun.emr.rss.common.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:95)
   	at com.aliyun.emr.rss.common.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:103)
   	at com.aliyun.emr.rss.client.ShuffleClientImpl.setupMetaServiceRef(ShuffleClientImpl.java:1089)
   	at com.aliyun.emr.rss.client.ShuffleClient.get(ShuffleClient.java:86)
   	at org.apache.spark.shuffle.rss.RssShuffleReader.<init>(RssShuffleReader.scala:43)
   	at org.apache.spark.shuffle.rss.RssShuffleManager.getReader(RssShuffleManager.scala:203)
   	at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:190)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1509)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
   	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
   	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
   	at com.aliyun.emr.rss.common.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
   	at com.aliyun.emr.rss.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:74)
   	... 32 more 
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on a diff in pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444#discussion_r1173437469


##########
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala:
##########
@@ -93,7 +95,11 @@ class ReducePartitionCommitHandler(
   }
 
   override def setStageEnd(shuffleId: Int): Unit = {
-    stageEndShuffleSet.add(shuffleId)
+    getReducerFileGroupRequest synchronized {
+      stageEndShuffleSet.add(shuffleId)
+      getReducerFileGroupRequest.remove(shuffleId)
+        .asScala.foreach(replyGetReducerFileGroup(_, shuffleId))

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled

Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444#issuecomment-1517376331

   ping @RexXiong


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] codecov[bot] commented on pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled

Posted by "codecov[bot] (via GitHub)" <gi...@apache.org>.
codecov[bot] commented on PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444#issuecomment-1517236301

   ## [Codecov](https://codecov.io/gh/apache/incubator-celeborn/pull/1444?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1444](https://codecov.io/gh/apache/incubator-celeborn/pull/1444?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (551a6b6) into [main](https://codecov.io/gh/apache/incubator-celeborn/commit/6830cb61efb09a1bbeb1ee8a6f54e92528339e8d?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (6830cb6) will **increase** coverage by `0.21%`.
   > The diff coverage is `n/a`.
   
   ```diff
   @@            Coverage Diff             @@
   ##             main    #1444      +/-   ##
   ==========================================
   + Coverage   44.76%   44.97%   +0.21%     
   ==========================================
     Files         156      156              
     Lines        9580     9580              
     Branches      956      956              
   ==========================================
   + Hits         4288     4308      +20     
   + Misses       5009     4993      -16     
   + Partials      283      279       -4     
   ```
   
   
   [see 3 files with indirect coverage changes](https://codecov.io/gh/apache/incubator-celeborn/pull/1444/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on a diff in pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled

Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on code in PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444#discussion_r1173430611


##########
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala:
##########
@@ -93,7 +95,11 @@ class ReducePartitionCommitHandler(
   }
 
   override def setStageEnd(shuffleId: Int): Unit = {
-    stageEndShuffleSet.add(shuffleId)
+    getReducerFileGroupRequest synchronized {
+      stageEndShuffleSet.add(shuffleId)
+      getReducerFileGroupRequest.remove(shuffleId)
+        .asScala.foreach(replyGetReducerFileGroup(_, shuffleId))

Review Comment:
   IMO we should move the invocation of replyGetReducerFileGroup out the synchronize block to avoid the critical section cost too much time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444#issuecomment-1517229868

   ping @pan3793 @waitinfuture @FMX @RexXiong 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu merged pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu merged PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org