You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@shardingsphere.apache.org by GitBox <gi...@apache.org> on 2022/01/14 11:09:52 UTC

[GitHub] [shardingsphere-elasticjob] RedRed228 commented on issue #2038: OneOffJob cannot be triggered on some instances occasionally

RedRed228 commented on issue #2038:
URL: https://github.com/apache/shardingsphere-elasticjob/issues/2038#issuecomment-1013025677


   The issue sometimes looks like: when we triggered the job, no shard was triggered at all. 
   I didn’t get the dump file but checked the zookeeper node directly when the problem occurred. From the zookeeper data I saw the resharding necessary flag was set and the leader was elected, but the leader didn’t do resharding.
   The leader instance has many debug logs:
   ![任务触发后卡住时主节点debug日志](https://user-images.githubusercontent.com/16835994/149505567-e1735c0e-639a-40fc-ba53-8e6a44b34b5a.jpg)
   The ip server node is a server that have been deleted by us. We deleted the node and this is the cause of the issue, I will discuss of the reason why we delete the server node later.
   The follower instances have many debug logs:
   ![任务触发后卡住时非主节点debug日志](https://user-images.githubusercontent.com/16835994/149505685-a86d35f6-4ace-49e0-9f74-486382f01b0e.jpg)
   With the debug logs, we can infer that the leader didn’t do resharding. So I dumped the leader’s stack:
   ![Curator-SafeNotifyService-0线程死循环堆栈](https://user-images.githubusercontent.com/16835994/149505718-7c050d5c-dde0-430b-b6f4-0e0e3cb65e40.jpg)
   I found that the "Curator-SafeNotifyService-0" thread always stuck here, falling into endless loop at ServerService.isEnableServer():
   public boolean isEnableServer(final String ip) {
       String serverStatus = jobNodeStorage.getJobNodeData(serverNode.getServerNode(ip));
       while (Strings.isNullOrEmpty(serverStatus)) {
           BlockUtils.waitingShortTime();
           serverStatus = jobNodeStorage.getJobNodeData(serverNode.getServerNode(ip));
       }
       return !ServerStatus.DISABLED.name().equals(serverStatus);
   }
   
   Reason found:
       It's because our application was runing on cloud-native architechture, and we found that the servers node's children become more and more with the container(pod) being destroyed and created again and again. So I registered a shutdownHook when the container is created, which would remove the instance's server node under servers node from zookeeper when the container is destroyed by calling JobOperateAPI's remove method.
      It caused this issue because when container was destroyed, the instance gone down and JobCrashedJobListener will be triggered, and the listener will fall into the endless loop as mentioned above if the instance’s server node was deleted.
   	All shards would not be triggered if the leader node‘s "Curator-SafeNotifyService-0" thread fall into the endless loop because only the leader node can do resharding, and all the listeners are triggered by the same thread(This maybe another issue.). But if the leader node didn’t fall into the endless loop, resharding will success and we would see that some shards are trigged successfully but those whose instance fell into the endless loop are not, just like the issue. 
   	But we do need to solve the problem of server nodes becoming more and more under servers node, so before delete the server node, we sleep some time longer than sessionTimeout to ensure that all the alive instances have finished JobCrashedJobListener to avoid "Curator-SafeNotifyService-0" thread falling into the endless loop. 
   	And I’m not sure if we can call addListener(listener, executor) instead of addListener(listener) to register listeners in JobNodeStorage’s addDataListener method:
   	public void addDataListener(final CuratorCacheListener listener) {
                       CuratorCache cache = (CuratorCache) regCenter.getRawCache("/" + jobName);
                       cache.listenable().addListener(listener);
           }
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@shardingsphere.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org