You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/08/05 04:14:22 UTC

[GitHub] [dolphinscheduler] caishunfeng opened a new issue, #7024: [Feature][MasterWorker] Self-recovery when master or worker lost connection from registry center

caishunfeng opened a new issue, #7024:
URL: https://github.com/apache/dolphinscheduler/issues/7024

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar feature requirement.
   
   
   ### Background
   
   Now if a master or worker lost zk connection, it doesn't disconver itself immediately until checking the dead server list.
   And when it konws that it was judged to dead serve, it will stop itself, without self-recovery.
   
   ### Prosoal
   
   ![d264093da134508c29b021d9c57d440](https://user-images.githubusercontent.com/11962619/143667882-118377c2-c7d2-4e5d-9eea-05db9eb52d61.png)
   
   When master lost zk connection:
   1. update current server state to `wait reconnect`
   2. send server lost connection alert
   3. keep quartz working (it will ensure work normally by quartz and db)
   4. stop accepting new request
   4. stop handling commands and process instances, clear the local running process instances; (it will be take over by other master)
   5. wait to reconnect
   6. when reconnect successfully, send server recover alert, update server state to `normal` and recover working
   
   
   ![628345e6b1dd3cf147f8d40ee35021f](https://user-images.githubusercontent.com/11962619/143667893-da5decc9-7588-42ca-b3d9-503cb029bb27.png)
   
   When worker lost zk connection:
   1. update current server state to `wait reconnect`
   2. send server lost connection alert
   2. kill the running task (it will be task over by master and rerun)
   3. stop accepting new request
   4. wait to reconnect within a certain time
   5. if reconnect timeout, stop itself
   6. when reconnect successfully,  send server recover alert, update server state to `normal` and recover working;
   
   _Originally posted by @caishunfeng in https://github.com/apache/dolphinscheduler/discussions/6643#discussioncomment-1706255_
   
   ### Related issues
   
   #7004 
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] ruanwenjun closed issue #7024: [Improvement][MasterWorker] Self-recovery when master or worker lost connection from registry center

Posted by GitBox <gi...@apache.org>.
ruanwenjun closed issue #7024: [Improvement][MasterWorker] Self-recovery when master or worker lost connection from registry center
URL: https://github.com/apache/dolphinscheduler/issues/7024


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org