You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@celeborn.apache.org by "AngersZhuuuu (via GitHub)" <gi...@apache.org> on 2023/04/21 04:07:27 UTC
[GitHub] [incubator-celeborn] AngersZhuuuu opened a new pull request, #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
AngersZhuuuu opened a new pull request, #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444
### What changes were proposed in this pull request?
### Why are the changes needed?
handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
```
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
23/04/21 08:37:52 INFO LifecycleManager: [handleGetReducerFileGroup] Waiting for handleStageEnd complete...
```
```
23/04/21 08:37:14 ERROR Executor: Exception in task 67.2 in stage 5.0 (TID 27521)
com.aliyun.emr.rss.common.rpc.RpcTimeoutException: Futures timed out after [30 seconds]. This timeout is controlled by rss.rpc.lookupTimeout
at com.aliyun.emr.rss.common.rpc.RpcTimeout.com$aliyun$emr$rss$common$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:46)
at com.aliyun.emr.rss.common.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:61)
at com.aliyun.emr.rss.common.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:57)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at com.aliyun.emr.rss.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at com.aliyun.emr.rss.common.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:95)
at com.aliyun.emr.rss.common.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:103)
at com.aliyun.emr.rss.client.ShuffleClientImpl.setupMetaServiceRef(ShuffleClientImpl.java:1089)
at com.aliyun.emr.rss.client.ShuffleClient.get(ShuffleClient.java:86)
at org.apache.spark.shuffle.rss.RssShuffleReader.<init>(RssShuffleReader.scala:43)
at org.apache.spark.shuffle.rss.RssShuffleManager.getReader(RssShuffleManager.scala:203)
at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:190)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1509)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at com.aliyun.emr.rss.common.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at com.aliyun.emr.rss.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:74)
... 32 more
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-celeborn] AngersZhuuuu commented on a diff in pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444#discussion_r1173437469
##########
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala:
##########
@@ -93,7 +95,11 @@ class ReducePartitionCommitHandler(
}
override def setStageEnd(shuffleId: Int): Unit = {
- stageEndShuffleSet.add(shuffleId)
+ getReducerFileGroupRequest synchronized {
+ stageEndShuffleSet.add(shuffleId)
+ getReducerFileGroupRequest.remove(shuffleId)
+ .asScala.foreach(replyGetReducerFileGroup(_, shuffleId))
Review Comment:
Updated
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-celeborn] waitinfuture commented on pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444#issuecomment-1517376331
ping @RexXiong
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-celeborn] codecov[bot] commented on pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
Posted by "codecov[bot] (via GitHub)" <gi...@apache.org>.
codecov[bot] commented on PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444#issuecomment-1517236301
## [Codecov](https://codecov.io/gh/apache/incubator-celeborn/pull/1444?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#1444](https://codecov.io/gh/apache/incubator-celeborn/pull/1444?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (551a6b6) into [main](https://codecov.io/gh/apache/incubator-celeborn/commit/6830cb61efb09a1bbeb1ee8a6f54e92528339e8d?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (6830cb6) will **increase** coverage by `0.21%`.
> The diff coverage is `n/a`.
```diff
@@ Coverage Diff @@
## main #1444 +/- ##
==========================================
+ Coverage 44.76% 44.97% +0.21%
==========================================
Files 156 156
Lines 9580 9580
Branches 956 956
==========================================
+ Hits 4288 4308 +20
+ Misses 5009 4993 -16
+ Partials 283 279 -4
```
[see 3 files with indirect coverage changes](https://codecov.io/gh/apache/incubator-celeborn/pull/1444/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
:mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-celeborn] waitinfuture commented on a diff in pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on code in PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444#discussion_r1173430611
##########
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala:
##########
@@ -93,7 +95,11 @@ class ReducePartitionCommitHandler(
}
override def setStageEnd(shuffleId: Int): Unit = {
- stageEndShuffleSet.add(shuffleId)
+ getReducerFileGroupRequest synchronized {
+ stageEndShuffleSet.add(shuffleId)
+ getReducerFileGroupRequest.remove(shuffleId)
+ .asScala.foreach(replyGetReducerFileGroup(_, shuffleId))
Review Comment:
IMO we should move the invocation of replyGetReducerFileGroup out the synchronize block to avoid the critical section cost too much time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-celeborn] AngersZhuuuu commented on pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444#issuecomment-1517229868
ping @pan3793 @waitinfuture @FMX @RexXiong
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-celeborn] AngersZhuuuu merged pull request #1444: [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu merged PR #1444:
URL: https://github.com/apache/incubator-celeborn/pull/1444
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org