You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/11/18 09:08:12 UTC

[GitHub] [incubator-uniffle] xianjingfeng opened a new issue, #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

xianjingfeng opened a new issue, #339:
URL: https://github.com/apache/incubator-uniffle/issues/339

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues.
   
   
   ### What would you like to be improved?
   
   Now in `org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient#sendShuffleData`, it will retry to send to one shuffle server for a long time and fail after reach `rss.client.send.check.timeout.ms`. Exception as follows:
   
   `Timeout: Task[2852_0] failed because 200 blocks can't be sent to shuffle server in 600000 ms.`
   
   This will cause that client will not send data to other servers.
   
   ### How should we improve?
   
   1. Don't retry in `requirePreAllocation` and just retry in upper level
   2. Set the default value of `rss.client.send.check.timeout.ms` to a smaller value, such as 10.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328002063

   > If we use multi replicas and then we encounter rpc exception, it seems that we shouldn't retry.
   
   How about if there is a problem with the client network.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1327996397

   > > Could we add a config option for multi replicas timeout? If we use multi replicas, we could use this timeout.
   > 
   > It's a little complicated. How about set real timeout to `rss.client.send.check.timeout.ms * 2` if `rss.data.replica.skip.enabled=true`
   > 
   > Another problems:
   > 
   > 1. Before [[ISSUE-76] Disallow sendShuffleData if requireBufferId expired #159](https://github.com/apache/incubator-uniffle/pull/159) it will retry 100 times, but it will retry 300 times after it. Should we set `rss.client.retry.max=33` for recovery it?
   > 
   > > `rss.client.send.check.timeout.ms=600000` `rss.client.retry.max=100` `rss.client.retry.interval.max=10000`
   > 
   > 100 * 10000 = 1000000 > 600000, This default values seems unreasonable.
   
   Could we add some retry strategy? When we encounter some exception, we won't retry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328011374

   Does it waste too much time that we still retry `requirePreAllocation` because of no buffer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1319735279

   Maybe this has been optimized in #332 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328012049

   > Does it waste too much time that we still retry `requirePreAllocation` because of no buffer?
   
   Yes. Because it is very common that there is no buffer, so we should retry for a few times. Maybe 33 is too many, i think 10 is enough


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1321368429

   > Make sense.
   > 
   > But this may be hard to control. If you have some ideas, please go ahead
   
   I just change the default value and adjust the retry logic introduced by #159 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1327573398

   Do you want to solve to avoid sending too much data to a heavy load server?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1327964928

   > How can we judge whether the server is writable?
   
   We can set a timeout to per server. The current problem is that there is only the timeout for block sending and the time to send data to one server is bigger than the timeout for block sending


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1319821373

   It seems not. This issue describes that when there are multiple replicas, if the first server can't be written, we should switch to servers as soon as possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328004001

   > > If we use multi replicas and then we encounter rpc exception, it seems that we shouldn't retry.
   > 
   > How about if there is a problem with the client network.
   
   Em.... it seems better that we reduce the retry times. Evenly, retry times could be less than 33. For online service, 1 retryTimes is enough. For our production, service usually will happen full gc. It may last 20s. For multi replicas, you could give a better advice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1327619926

   > Do you want to solve to avoid sending too much data to a heavy load server?
   
   No, i just want to solve that when the first shuffle server can't be written, we can send data to another one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1327653971

   How can we judge whether the server is writable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] advancedxy commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

advancedxy commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328007047

   On the other side, besides to reduce retry times, the client could send the same data to different shuffle server concurrently? Other shuffle system does that.
   
   However it may require large effort of refactoring and the network bandwidth and shuffle data usage doubled. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328017090

   > Thanks for pointing out that. I did a quick look at the shuffle write client code, it seems that the `sendShuffleData` process is blocked by the first round to the primary shuffle servers. I think rather to reduce the `rss.client.send.check.timeout.ms` settings(-> which causes task retry), how about refactor `sendShuffleDataAsync` to return early and the send round of `sendShuffleDataAsync` could start asap.
   > 
   > The purpose of sending data concurrently is that once a request finished, the whole sending process would finish, which avoids busy/flaky servers.
   
    `rss.client.send.check.timeout.ms` is for limit the max time to finish send blocks to all replicas. And how to let it return early is what we need discuss.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328014056

   Maybe It's better that  use the timeout `rss.client.send.check.timeout.ms * replicateWrite`. RetryTimes use `rss.client.retry.max / replicateWrite`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] advancedxy commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

advancedxy commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328014871

   > > the client could send the same data to different shuffle server concurrently?
   > 
   > It has done this. you can see
   > 
   > https://github.com/apache/incubator-uniffle/blob/57a834b083d9dae291091e4f2359d060ba30abe0/client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java#L240-L258
   
   Thanks for pointing out that. I did a quick look at the shuffle write client code, it seems that the `sendShuffleData` process is blocked by the first round to the primary shuffle servers. I think rather to reduce the `rss.client.send.check.timeout.ms` settings(-> which causes task retry), how about refactor `sendShuffleDataAsync` to return early and the send round of `sendShuffleDataAsync` could start asap.
   
   The purpose of sending data concurrently is that once a request finished, the whole sending process would finish, which avoids busy/flaky servers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328008770

   > Em.... it seems better that we reduce the retry times. Evenly, retry times could be less than 33. For online service, 1 retryTimes is enough. For our production, service usually will happen full gc. It may last 20s. For multi replicas, you could give a better advice.
   
   `rss.client.retry.max` is just for `requirePreAllocation` and it just retry when server has no buffer.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1327995864

   > Could we add a config option for multi replicas timeout? If we use multi replicas, we could use this timeout.
   
   It's a little complicated. How about set real timeout to  `rss.client.send.check.timeout.ms *  2` if `rss.data.replica.skip.enabled=true`
   
   Another problems:
   1. Before #159 it will retry 100 times, but it will retry 300 times after it. Should we set `rss.client.retry.max=33` for recovery it?
   2.  
   > `rss.client.send.check.timeout.ms=600000` `rss.client.retry.max=100` `rss.client.retry.interval.max=10000`
   
   100 * 10000 = 1000000 > 600000,  This default values seems unreasonable.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1322003687

   > > I don't suggest that we modify the default value. Because we should verify the parameters in our production environment with huge amount of data first.
   > 
   > But the current default value is obviously unreasonable. `rss.client.send.check.timeout.ms=600000` `rss.client.retry.max=100` `rss.client.retry.interval.max=10000` It will not send data to other servers if the first server blocked. And after #159, it will retry 300 times
   
   Is it ok for single replica?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328020945

   > 
   > > > `rss.client.send.check.timeout.ms=600000` `rss.client.retry.max=100` `rss.client.retry.interval.max=10000`
   > > 
   > > 
   > > 100 * 10000 = 1000000 > 600000, This default values seems unreasonable.
   > 
   > But we should also solve this problem first.
   
   Should we reduce `rss.client.retry.max` and `rss.client.retry.interval.max=10000` according to write replicas?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328022694

   > > > > `rss.client.send.check.timeout.ms=600000` `rss.client.retry.max=100` `rss.client.retry.interval.max=10000`
   > > > 
   > > > 
   > > > 100 * 10000 = 1000000 > 600000, This default values seems unreasonable.
   > > 
   > > 
   > > But we should also solve this problem first.
   > 
   > Should we reduce `rss.client.retry.max` and `rss.client.retry.interval.max=10000` according to write replicas?
   
   I think it is ok.  And i think we should remove `rss.client.send.check.timeout.ms`, we can just control by `rss.client.retry.max` and `rss.client.retry.interval.max`. Otherwise, it's too complex and diffcult to control.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328068030

   > > > > > `rss.client.send.check.timeout.ms=600000` `rss.client.retry.max=100` `rss.client.retry.interval.max=10000`
   > > > > 
   > > > > 
   > > > > 100 * 10000 = 1000000 > 600000, This default values seems unreasonable.
   > > > 
   > > > 
   > > > But we should also solve this problem first.
   > > 
   > > 
   > > Should we reduce `rss.client.retry.max` and `rss.client.retry.interval.max=10000` according to write replicas?
   > 
   > I think it is ok. And i think we should remove `rss.client.send.check.timeout.ms`, we can just control by `rss.client.retry.max` and `rss.client.retry.interval.max`. Otherwise, it's too complex and diffcult to control.
   
   Could we check  `rss.client.send.check.timeout.ms`? If   `rss.client.send.check.timeout.ms` is less than `rss.client.retry.max * rss.client.retry.interval.max`, we throw an exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

zuston commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1320037464

   Make sense. 
   
   But this may be hard to control. If you have some ideas, please go ahead


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1320063081

   I don't suggest that we modify the default value. Because we should verify the parameters in our production environment with huge amount of data first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1321367750

   > I don't suggest that we modify the default value. Because we should verify the parameters in our production environment with huge amount of data first.
   
   But the current default value is obviously unreasonable.
   `rss.client.send.check.timeout.ms=60000`  `rss.client.retry.max=100`  `rss.client.retry.interval.max=10000`
   It will not send data to other servers if the first server blocked.
   And after #159, it will retry 300 times


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1323765728

   > Is it ok for single replica?
   
   Yes. But I think we usually use multiple replicas.If the user is not familiar with this mechanism, he will think that fault tolerance can be achieved without modifying other configurations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1327292966

   Do you have any better ideas? @jerqi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1327971057

   > > How can we judge whether the server is writable?
   > 
   > We can set a timeout to per server. The current problem is that there is only the timeout for block sending and the time to send data to one server is bigger than the timeout for block sending
   
   Could we add a config option for multi replicas timeout? If we use multi replicas, we could use this timeout.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328159774

   > Could we check `rss.client.send.check.timeout.ms`? If `rss.client.send.check.timeout.ms` is less than `rss.client.retry.max * rss.client.retry.interval.max`, we throw an exception.
   
   I’m ok


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328000060

   <img width="596" alt="企业微信截图_42660f61-94f9-4f06-87ce-1e87ecb0859b" src="https://user-images.githubusercontent.com/8159038/204078265-dbe47223-df60-4a8e-827e-7b122e89e07f.png">
   If we use multi replicas and then we encounter rpc exception, it seems that we shouldn't retry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1327998854

   > Could we add some retry strategy? When we encounter some exception, we won't retry.
   
   It will not retry if some exception encountered in `requirePreAllocation`, it just retry when server has no buffer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328008527

   > the client could send the same data to different shuffle server concurrently?
   
   It has done this. you can see
   https://github.com/apache/incubator-uniffle/blob/57a834b083d9dae291091e4f2359d060ba30abe0/client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java#L240-L258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

xianjingfeng commented on issue #339:
URL: https://github.com/apache/incubator-uniffle/issues/339#issuecomment-1328015950

   > rss.client.send.check.timeout.ms * replicateWrite
   
   `rss.client.send.check.timeout.ms * 2`,  because it will send two rounds no matter how many replica.
   And i am ok for this logic. 
   
   
   > > `rss.client.send.check.timeout.ms=600000` `rss.client.retry.max=100` `rss.client.retry.interval.max=10000`
   > 
   > 100 * 10000 = 1000000 > 600000, This default values seems unreasonable.
   
   But we should also solve this problem first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-uniffle] jerqi closed issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`

Posted by GitBox <gi...@apache.org>.

jerqi closed issue #339: [Improvement] Optimize retry logic in `ShuffleServerGrpcClient#sendShuffleData`
URL: https://github.com/apache/incubator-uniffle/issues/339


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org