You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2021/05/17 09:33:38 UTC

[GitHub] [dolphinscheduler] ruanwenjun commented on pull request #5211: [Improvement][Server] Must restart master if Zk reconnect (#5210)

ruanwenjun commented on pull request #5211:
URL: https://github.com/apache/dolphinscheduler/pull/5211#issuecomment-842175284


   @CalvinKirs The reason for this issue is that when the `master` or `worker` reconnects to `zookeeper`, `zkClient` will produce a `RECONNECTED` event, the register will respond to this event.
   https://github.com/apache/dolphinscheduler/blob/68301db6b914ff4002bfbc531c6810864d8e47c2/dolphinscheduler-server/src/main/java/org/apache/dolphinscheduler/server/master/registry/MasterRegistry.java#L83-L98
   The register will execute `zookeeperRegistryCenter.getRegisterOperator().persistEphemeral(localNodePath, "");` to create an ephemeral node, this is reasonable.
   But the `persistEphemeral` method will do two things to create an ephemeral node.
   1. delete the ephemeral node(if exist)
   2. create a new ephemeral node
   https://github.com/apache/dolphinscheduler/blob/68301db6b914ff4002bfbc531c6810864d8e47c2/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/zk/ZookeeperOperator.java#L192-L201
   In this case,  if the `master` or `worker` reconnected to zookeeper, the ephemeral node may exist, because the session may not have expired.
   So the `persistEphemeral` method will delete the existing ephemeral node and create a new one. The problem is that when delete the node, it will produce a `NODE_REMOVED` event, and the `ZKMasterClient` will response this event, move the master to dead server. And the `HeartBeatTask` will close the master.
   ---
   I think a good way to solve this issue is when we create an ephemeral node after reconnect, if the node exist, we shouldn't remove it. In this case the node.
   We just need to modify this method to change the line 196 just return.
   https://github.com/apache/dolphinscheduler/blob/68301db6b914ff4002bfbc531c6810864d8e47c2/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/zk/ZookeeperOperator.java#L192-L201


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org