You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2020/10/28 03:28:26 UTC

[GitHub] [incubator-dolphinscheduler] rockxsj opened a new issue #4003: [Bug][master&worker] zk node should remove when the worker node is removed

rockxsj opened a new issue #4003:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4003


   *For better global communication, please give priority to using English description, thx! *
   
   *Please review https://dolphinscheduler.apache.org/en-us/docs/development/issue.html when describe an issue.*
   
   **Describe the bug**
   When I delete some worker pods from k8s cluster, the master & worker node will always query the deleted node and print error level log.
   I think the better solution is stop check the node is zk when like five times check failure.
   
   **To Reproduce**
   Steps to reproduce the behavior, for example:
   1. Run two master pods and two worker pods.
   2. Delete one or two worker pods.
   
   **Expected behavior**
   Master and worker server check five times zk node exists, if always failure, then stop check this node.
   
   **Screenshots**
   `
   2020-10-28 11:21:21 | [ERROR] 2020-10-28 11:21:21.284 org.apache.dolphinscheduler.service.zk.ZookeeperOperator:[122] - get key : /data/ds/nodes/worker/default/10.122.84.49:1234
   -- | --
     |   | 2020-10-28 11:21:18 | at java.lang.Thread.run(Thread.java:748)
     |   | 2020-10-28 11:21:18 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     |   | 2020-10-28 11:21:18 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     |   | 2020-10-28 11:21:18 | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
     |   | 2020-10-28 11:21:18 | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
     |   | 2020-10-28 11:21:18 | at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
     |   | 2020-10-28 11:21:18 | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
     |   | 2020-10-28 11:21:18 | at org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager$RefreshResourceTask.run(LowerWeightHostManager.java:152)
     |   | 2020-10-28 11:21:18 | at org.apache.dolphinscheduler.service.zk.ZookeeperOperator.get(ZookeeperOperator.java:120)
     |   | 2020-10-28 11:21:18 | at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
     |   | 2020-10-28 11:21:18 | at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
     |   | 2020-10-28 11:21:18 | at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
     |   | 2020-10-28 11:21:18 | at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:81)
     |   | 2020-10-28 11:21:18 | at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:67)
     |   | 2020-10-28 11:21:18 | at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
     |   | 2020-10-28 11:21:18 | at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
     |   | 2020-10-28 11:21:18 | at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1221)
     |   | 2020-10-28 11:21:18 | at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
     |   | 2020-10-28 11:21:18 | at org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
     |   | 2020-10-28 11:21:18 | org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /data/ds/nodes/worker/default/10.122.84.49:1234
   `
   
   
   **Which version of Dolphin Scheduler:**
    -[1.3.2-release]
   
   **Additional context**
   Add any other context about the problem here.
   
   **Requirement or improvement**
   - Please describe about your requirements or improvement suggestions.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] CalvinKirs commented on issue #4003: [Bug][master&worker] zk node check should stop when continuous several times check failure

Posted by GitBox <gi...@apache.org>.

CalvinKirs commented on issue #4003:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4003#issuecomment-742413056


   sorry, please ignore me, this may not be a problem


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] CalvinKirs removed a comment on issue #4003: [Bug][master&worker] zk node check should stop when continuous several times check failure

Posted by GitBox <gi...@apache.org>.

CalvinKirs removed a comment on issue #4003:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4003#issuecomment-731579735


   Hi, did you mean that you deleted the worker node manually, and this time the worker node did not report the zk offline action?
   
   hi，我确认一下，你的意思是指，你是下线的时候手动删除，这个时候没有触发相应的woker下线上报行为，因此zk节点任务worker依然存活，对吗？
   
   此外，我认同你的说法，心跳检测这块我们应该移除相关节点信息，否则心跳检测就没有了意义。


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] CalvinKirs commented on issue #4003: [Bug][master&worker] zk node check should stop when continuous several times check failure

Posted by GitBox <gi...@apache.org>.

CalvinKirs commented on issue #4003:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4003#issuecomment-773980907


   > @rockxsj @CalvinKirs Has this issue been fixed?
   > I try to reproduce the problem, and start 3 masters and 3 workers, and then kill all workers under the latest 1.3.5-prepare branch.
   > But I get logs as follows:
   > ![image](https://user-images.githubusercontent.com/4902714/107028588-b879dd80-67e8-11eb-804e-87ec0cfeda2b.png)
   > 
   > As you mentioned querying the deleted node and printing error level log doesn't appear.
   > Everything is normal, does it mean that this problem no longer exists?
   
   I think this is not a problem. We may need more information to help users troubleshoot.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] chengshiwen closed issue #4003: [Bug][master&worker] zk node check should stop when continuous several times check failure

Posted by GitBox <gi...@apache.org>.

chengshiwen closed issue #4003:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4003


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] rockxsj commented on issue #4003: [Bug][master&worker] zk node check should stop when continuous several times check failure

Posted by GitBox <gi...@apache.org>.

rockxsj commented on issue #4003:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4003#issuecomment-717770036


   ![image](https://user-images.githubusercontent.com/3021207/97408929-c8f00180-1937-11eb-8f4f-d38063398b50.png)
   Maybe when the zk node does not exists, we should remove this node from the set?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] CalvinKirs commented on issue #4003: [Bug][master&worker] zk node check should stop when continuous several times check failure

Posted by GitBox <gi...@apache.org>.

CalvinKirs commented on issue #4003:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4003#issuecomment-731579735


   Hi, did you mean that you deleted the worker node manually, and this time the worker node did not report the zk offline action?
   
   hi，我确认一下，你的意思是指，你是下线的时候手动删除，这个时候没有触发相应的woker下线上报行为，因此zk节点任务worker依然存活，对吗？
   
   此外，我认同你的说法，心跳检测这块我们应该移除相关节点信息，否则心跳检测就没有了意义。


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] chengshiwen commented on issue #4003: [Bug][master&worker] zk node check should stop when continuous several times check failure

Posted by GitBox <gi...@apache.org>.

chengshiwen commented on issue #4003:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4003#issuecomment-773979589


   @rockxsj @CalvinKirs Has this issue been fixed?
   I try to reproduce the problem, and start 3 masters and 3 workers, and then kill all workers under the latest 1.3.5-prepare branch.
   But I get logs as follows:
   ![image](https://user-images.githubusercontent.com/4902714/107028588-b879dd80-67e8-11eb-804e-87ec0cfeda2b.png)
   
   As you mentioned querying the deleted node and printing error level log doesn't appear.
   Everything is normal, does it mean that this problem no longer exists?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] chengshiwen commented on issue #4003: [Bug][master&worker] zk node check should stop when continuous several times check failure

Posted by GitBox <gi...@apache.org>.

chengshiwen commented on issue #4003:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4003#issuecomment-817110660


   Close this issue since no updates for long time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] CalvinKirs commented on issue #4003: [Bug][master&worker] zk node check should stop when continuous several times check failure

Posted by GitBox <gi...@apache.org>.

CalvinKirs commented on issue #4003:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4003#issuecomment-732191101


   Do you have any different opinions on this? If there is no problem, I will submit a PR to fix it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org