You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/09/21 10:11:28 UTC

[GitHub] [incubator-uniffle] zuston opened a new issue, #234: Introduce rejection mechanism when coordinator server is starting

zuston opened a new issue, #234:
URL: https://github.com/apache/incubator-uniffle/issues/234

   ### Background
   When changing some coordinator's conf and then restart, coordinator will accept client `getAssignment` request immediately, but it will serve for jobs request based on the partial registered shuffle-servers, which will make some jobs gotten not enough required shuffle-servers and then slow the running speed.
   
   I think we should make coordinator wait for more than one shuffle-server heartbeat interval before serving for client. During out-of-service, requests from client will fallback to slave coordinator.
   
   Besides, I think this rejection mechanism could be enabled by the coordinator conf.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1253585138

   How do we judge whether get enough shuffle servers? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1253655310

   > How do we judge whether to get enough shuffle servers?
   
   Wait until reaching the shuffle server heartbeat interval, default 10s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1258812440

   Any ideas on this? @jerqi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1274427139

   Solved in https://github.com/apache/incubator-uniffle/pull/247. Close this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1254666149

   > Now is the single coordinator process maintained in single POD or shared StatefulSet? If using single POD, it's not a problem. If using the shared statefulset, maybe we should kill single pod which managed by statefuset one by one by operator.
   > 
   > @wangao1236 Could u help give some background knowledge?
   
   Coordinator use two deployments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1254684732

   > > I think this design is not friendly for automation deployment. It's my doubt.
   > 
   > The coordinator deployment could be controlled to start by operator, this is not the problem. Right?
   
   It's ok for our k8s operator. But if other users use another automation deployment mechanism, it may cause problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston closed issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
zuston closed issue #234: Introduce rejection mechanism when coordinator server is starting
URL: https://github.com/apache/incubator-uniffle/issues/234


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1254557699

   Now is the single coordinator process maintained in single POD or shared StatefulSet? If using single POD, it's not a problem. If using the shared statefulset, maybe we should kill single pod which managed by statefuset one by one by operator. 
   
   @wangao1236 Could u help give some background knowledge? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1254432325

   > > How do we judge whether to get enough shuffle servers?
   > 
   > Wait until reaching the shuffle server heartbeat interval, default 10s
   
   Maybe one heartbeat interval is not enough, we can't wait for any servers in special case. How do the yarn resourcemanager to process this problem? I suggest that we should pend the requests instead of rejection when we start the coordinator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1254689680

   > > > I think this design is not friendly for automation deployment. It's my doubt.
   > > 
   > > 
   > > The coordinator deployment could be controlled to start by operator, this is not the problem. Right?
   > 
   > It's ok for our k8s operator. But if other users use another automation deployment mechanism, it may cause problems.
   
   Got it. Maybe we could disable this mechanism default and add some docs to describe more


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1258870882

   > Any ideas on this? @jerqi
   
   It's ok for me if we disable this mechanism by default. Is it a safe mode for coordinator?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1254479362

   > Got your thought.
   > 
   > > How do the yarn resourcemanager to process this problem?
   > 
   > In HA resourcemanagers, there is no such problems due to the mechanism of failing back to standby active RM by zookeeper. Let's talk about it in single-one resourcemanager or hadoop namenode. As I know, the namenode will enter in the safe mode when starting it will exit until enough block reports from datanode have been accepted. Refer to : https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
   > 
   > > I suggest that we should pend the requests instead of rejection when we start the coordinator.
   > 
   > Pending will slow down the apps. I think we should make the request falling back to another coordinator. Maybe the heartbeat interval waiting when starting is a good tradeoff, this will be an indicator whether to exit the safe mode for coordinator.
   
   It means that we shouldn't restart the two coordinators during the short time. It's a little difficult for K8S controller to select a proper interval to restart them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1253493665

   PTAL @jerqi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1258880403

   Naming is difficult. Safe mode/ Recovery mode?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1254448844

   Got your thought. 
   
   > How do the yarn resourcemanager to process this problem?
   
   In HA resourcemanagers, there is no such problems due to failing back to standby active RM by zookeeper. Let's talk about it in single-one resourcemanager or hadoop namenode. As I know, the namenode will enter in the safe mode util enough block reports from datanode have been accepted when starting. Refer to : https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
   
   > I suggest that we should pend the requests instead of rejection when we start the coordinator.
   
   Pending will slow down the apps. I think we should make the request falling back to another coordinator. Maybe the heartbeat interval waiting when starting is a good tradeoff.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1254680732

   > I think this design is not friendly for automation deployment. It's my doubt.
   
   The coordinator deployment could be controlled to start by operator, this is not the problem. Right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #234: Introduce rejection mechanism when coordinator server is starting

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #234:
URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1254667601

   I think this design is not friendly for automation deployment. It's my doubt.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org