You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/12/22 01:23:53 UTC

[GitHub] [dolphinscheduler] wcmolin opened a new issue, #13247: [Bug] [Master] Kill yarn job failed due to NPE exception

wcmolin opened a new issue, #13247:
URL: https://github.com/apache/dolphinscheduler/issues/13247

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues.
   
   
   ### What happened
   
   When the worker node is stopped, an NPE exception will occur when the master fault-tolerant thread starts. I think the problematic code is in this section:
   `org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient` 482 lines
   ```
   TaskExecutionContext taskExecutionContext = TaskExecutionContextBuilder.get()
           .buildTaskInstanceRelatedInfo(taskInstance)
           .buildProcessInstanceRelatedInfo(processInstance)
           .create();
   ```
   There is no assignment of processDefineCode and processDefineVersion of taskInstance here.
   
   log:
   ```
   
   [INFO] 2022-12-22 09:14:13.969 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[239] - worker group node : /nodes/worker/default/10.66.76.129:1234 down.
   [INFO] 2022-12-22 09:14:13.970 org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener:[80] - worker node deleted : /nodes/worker/default/10.66.76.129:1234
   [INFO] 2022-12-22 09:14:13.974 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[195] - WORKER node deleted : /nodes/worker/default/10.66.76.129:1234
   [INFO] 2022-12-22 09:14:13.978 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[205] - path: /nodes/worker/default/10.66.76.129:1234 not exists
   [INFO] 2022-12-22 09:14:14.035 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[377] - start worker[10.66.76.129:1234] failover, task list size:3
   [INFO] 2022-12-22 09:14:14.040 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[400] - failover task instance id: 416, process instance id: 231
   [ERROR] 2022-12-22 09:14:15.070 org.apache.dolphinscheduler.server.utils.ProcessUtils:[211] - kill yarn job failure
   java.lang.NullPointerException: null
   	at org.apache.dolphinscheduler.server.utils.ProcessUtils.killYarnJob(ProcessUtils.java:197)
   	at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverTaskInstance(MasterRegistryClient.java:496)
   	at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverWorker(MasterRegistryClient.java:401)
   	at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverServerWhenDown(MasterRegistryClient.java:231)
   	at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.removeWorkerNodePath(MasterRegistryClient.java:212)
   	at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.handleWorkerEvent(MasterRegistryDataListener.java:81)
   	at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.notify(MasterRegistryDataListener.java:55)
   	at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:127)
   	at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
   	at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
   	at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
   	at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
   	at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
   	at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
   	at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
   	at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   	at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
   	at java.util.concurrent.FutureTask.run(FutureTask.java)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   	at java.lang.Thread.run(Thread.java:748)
   [INFO] 2022-12-22 09:14:15.147 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[504] - workflowExecuteThreadNotify is null, just return, task id:416,process id:231
   ```
   
   ### What you expected to happen
   
   No NPE exceptions are generated
   
   ### How to reproduce
   
   Create a task that requires fault tolerance, then stop the worker server.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   2.0.x
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #13247: [Bug] [Master] Kill yarn job failed due to NPE exception

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #13247:
URL: https://github.com/apache/dolphinscheduler/issues/13247#issuecomment-1362282104

   Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
   * In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
   * If you haven't received a reply for a long time, you can [join our slack](https://s.apache.org/dolphinscheduler-slack) and send your question to channel `#troubleshooting`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] JinyLeeChina closed issue #13247: [Bug] [Master] Kill yarn job failed due to NPE exception

Posted by GitBox <gi...@apache.org>.
JinyLeeChina closed issue #13247: [Bug] [Master] Kill yarn job failed due to NPE exception
URL: https://github.com/apache/dolphinscheduler/issues/13247


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org