You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/12/30 15:48:15 UTC

[GitHub] [dolphinscheduler] hokie-chan opened a new issue, #13310: [Bug] [Worker] When registry recover from SUSPENDED to RECONNECTED, the worker will crash

hokie-chan opened a new issue, #13310:
URL: https://github.com/apache/dolphinscheduler/issues/13310

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues.
   
   
   ### What happened
   
   When worker registry (zookeeper) client recover from SUSPENDED to RECONNECTED, the worker itself will try to restart, but actually the worker is still normally running, then will cause "org.apache.dolphinscheduler.remote.exceptions.RemoteException: NettyRemotingServer bind 1234 fail", and finally cause the worker node hang.
   Please refer to "Anything else" section for logs.
   
   ### What you expected to happen
   
   When worker registry (zookeeper) client recover from SUSPENDED to RECONNECTED, the worker itself no need to restart, just do nothing, and of course, will not crash.
   
   ### How to reproduce
   
   Block the network connection from worker node to zookeeper server 2181 port, when we can see the "client timedout" log, the worker registry client will change to SUSPENDED status, 
   then resume the network connection, then the worker registry client will reconnect to zookeeper server, and status will change to RECONNECTED. Now we can see the exception happen.
   
   ### Anything else
   
   ```  
   [INFO] 2022-12-26 11:06:35.672 +0800 org.apache.dolphinscheduler.server.worker.task.WorkerHeartBeatTask:[89] - [WorkflowInstance-0][TaskInstance-0] - Success write worker group heartBeatInfo into registry, workerRegistryPath: /nodes/worker/10.x.x.2:1234 workerHeartBeatInfo: {"startupTime":1671855451894,"reportTime":1672023995663,"cpuUsage":0.01,"memoryUsage":0.72,"loadAverage":0.0,"availablePhysicalMemorySize":2.14,"maxCpuloadAvg":4.0,"reservedMemory":0.03,"diskAvailable":28.47,"serverStatus":0,"processId":5576,"workerHostWeight":100,"workerWaitingTaskCount":0,"workerExecThreadCount":100}
   [WARN] 2022-12-26 11:06:55.680 +0800 org.apache.zookeeper.ClientCnxn:[1249] - [WorkflowInstance-0][TaskInstance-0] - Client session timed out, have not heard from server in 20008ms for session id 0x1028fe2bbbc003f
   [WARN] 2022-12-26 11:06:55.683 +0800 org.apache.zookeeper.ClientCnxn:[1292] - [WorkflowInstance-0][TaskInstance-0] - Session 0x1028fe2bbbc003f for server ds-1/10.x.x.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 20008ms for session id 0x1028fe2bbbc003f
   	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1250)
   [INFO] 2022-12-26 11:06:55.785 +0800 org.apache.curator.framework.state.ConnectionStateManager:[252] - [WorkflowInstance-0][TaskInstance-0] - State change: SUSPENDED
   [WARN] 2022-12-26 11:06:55.786 +0800 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[52] - [WorkflowInstance-0][TaskInstance-0] - Registry suspended
   [INFO] 2022-12-26 11:06:55.787 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener:[46] - [WorkflowInstance-0][TaskInstance-0] - Worker received a SUSPENDED event from registry, the current server state is RUNNING
   [INFO] 2022-12-26 11:06:57.539 +0800 org.apache.zookeeper.ClientCnxn:[1171] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server ds-1/10.x.x.1:2181.
   [INFO] 2022-12-26 11:06:57.540 +0800 org.apache.zookeeper.ClientCnxn:[1173] - [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to authenticate using SASL (unknown error)
   [INFO] 2022-12-26 11:06:57.543 +0800 org.apache.zookeeper.ClientCnxn:[1005] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established, initiating session, client: /10.x.x.2:57590, server: ds-1/10.x.x.1:2181
   [INFO] 2022-12-26 11:06:57.567 +0800 org.apache.zookeeper.ClientCnxn:[1444] - [WorkflowInstance-0][TaskInstance-0] - Session establishment complete on server ds-1/10.x.x.1:2181, session id = 0x1028fe2bbbc003f, negotiated timeout = 30000
   [INFO] 2022-12-26 11:06:57.568 +0800 org.apache.curator.framework.state.ConnectionStateManager:[252] - [WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED
   [INFO] 2022-12-26 11:06:57.568 +0800 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[48] - [WorkflowInstance-0][TaskInstance-0] - Registry reconnected
   [INFO] 2022-12-26 11:06:57.568 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener:[46] - [WorkflowInstance-0][TaskInstance-0] - Worker received a RECONNECTED event from registry, the current server state is RUNNING
   [WARN] 2022-12-26 11:06:57.569 +0800 org.apache.dolphinscheduler.common.lifecycle.ServerLifeCycleManager:[73] - [WorkflowInstance-0][TaskInstance-0] - The current server status is already running, cannot recover form waiting
   [INFO] 2022-12-26 11:06:57.569 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[75] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server starting
   [ERROR] 2022-12-26 11:06:57.573 +0800 org.apache.dolphinscheduler.remote.NettyRemotingServer:[143] - [WorkflowInstance-0][TaskInstance-0] - NettyRemotingServer bind fail bind(..) failed: Address already in use, exit
   io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use
   [ERROR] 2022-12-26 11:06:57.573 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[104] - [WorkflowInstance-0][TaskInstance-0] - Recover from waiting failed, the current server status is RUNNING, will stop the server
   org.apache.dolphinscheduler.remote.exceptions.RemoteException: NettyRemotingServer bind 1234 fail
   	at org.apache.dolphinscheduler.remote.NettyRemotingServer.start(NettyRemotingServer.java:144)
   	at org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer.start(WorkerRpcServer.java:93)
   	at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.reStartWorkerResource(WorkerWaitingStrategy.java:130)
   	at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.reconnect(WorkerWaitingStrategy.java:97)
   	at org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener.onUpdate(WorkerConnectionStateListener.java:54)
   	at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:49)
   	at org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:281)
   	at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92)
   	at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89)
   	at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89)
   	at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:281)
   	at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
   	at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:134)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   [INFO] 2022-12-26 11:06:57.577 +0800 org.apache.dolphinscheduler.server.worker.task.WorkerHeartBeatTask:[89] - [WorkflowInstance-0][TaskInstance-0] - Success write worker group heartBeatInfo into registry, workerRegistryPath: /nodes/worker/10.x.x.2:1234 workerHeartBeatInfo: {"startupTime":1671855451894,"reportTime":1672024005673,"cpuUsage":0.01,"memoryUsage":0.72,"loadAverage":0.0,"availablePhysicalMemorySize":2.14,"maxCpuloadAvg":4.0,"reservedMemory":0.03,"diskAvailable":28.47,"serverStatus":0,"processId":5576,"workerHostWeight":100,"workerWaitingTaskCount":0,"workerExecThreadCount":100}
   [INFO] 2022-12-26 11:07:00.573 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[148] - [WorkflowInstance-0][TaskInstance-0] - Worker server is stopping, current cause : Recover from waiting failed, the current server status is RUNNING, will stop the server
   [INFO] 2022-12-26 11:07:00.594 +0800 org.eclipse.jetty.server.AbstractConnector:[383] - [WorkflowInstance-0][TaskInstance-0] - Stopped ServerConnector@1682e6a{HTTP/1.1, (http/1.1)}{0.0.0.0:1235}
   [INFO] 2022-12-26 11:07:00.595 +0800 org.eclipse.jetty.server.session:[149] - [WorkflowInstance-0][TaskInstance-0] - node0 Stopped scavenging
   [INFO] 2022-12-26 11:07:00.596 +0800 org.eclipse.jetty.server.handler.ContextHandler.application:[2368] - [WorkflowInstance-0][TaskInstance-0] - Destroying Spring FrameworkServlet 'dispatcherServlet'
   [INFO] 2022-12-26 11:07:00.597 +0800 org.eclipse.jetty.server.handler.ContextHandler:[1159] - [WorkflowInstance-0][TaskInstance-0] - Stopped o.s.b.w.e.j.JettyEmbeddedWebAppContext@6abb7b7d{application,/,[file:///tmp/jetty-docbase.1235.4335247786354365438/],STOPPED}
   [WARN] 2022-12-26 11:07:00.605 +0800 org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[69] - [WorkflowInstance-0][TaskInstance-0] - WorkerHeartBeatTask task finished
   [INFO] 2022-12-26 11:07:00.607 +0800 org.apache.curator.framework.imps.CuratorFrameworkImpl:[998] - [WorkflowInstance-0][TaskInstance-0] - backgroundOperationsLoop exiting
   [INFO] 2022-12-26 11:07:00.721 +0800 org.apache.zookeeper.ZooKeeper:[1232] - [WorkflowInstance-0][TaskInstance-0] - Session: 0x1028fe2bbbc003f closed
   [INFO] 2022-12-26 11:07:00.721 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient:[111] - [WorkflowInstance-0][TaskInstance-0] - Worker registry client closed
   [INFO] 2022-12-26 11:07:00.722 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[99] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closing
   [INFO] 2022-12-26 11:07:00.726 +0800 org.apache.dolphinscheduler.remote.NettyRemotingServer:[212] - [WorkflowInstance-0][TaskInstance-0] - netty server closed
   [INFO] 2022-12-26 11:07:00.726 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[101] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closed
   [INFO] 2022-12-26 11:07:00.729 +0800 org.apache.zookeeper.ClientCnxn:[568] - [WorkflowInstance-0][TaskInstance-0] - EventThread shut down for session: 0x1028fe2bbbc003f
   [INFO] 2022-12-26 11:07:00.730 +0800 org.apache.dolphinscheduler.remote.NettyRemotingClient:[334] - [WorkflowInstance-0][TaskInstance-0] - netty client closed
   [INFO] 2022-12-26 11:07:00.730 +0800 org.apache.dolphinscheduler.service.log.LogClient:[229] - [WorkflowInstance-0][TaskInstance-0] - LogClientService closed
   [INFO] 2022-12-26 11:07:00.730 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcClient:[71] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc client closing
   [INFO] 2022-12-26 11:07:00.733 +0800 org.apache.dolphinscheduler.remote.NettyRemotingClient:[334] - [WorkflowInstance-0][TaskInstance-0] - netty client closed
   [INFO] 2022-12-26 11:07:00.733 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcClient:[73] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc client closed
   [WARN] 2022-12-26 11:07:00.733 +0800 org.apache.dolphinscheduler.service.alert.AlertClientService:[76] - [WorkflowInstance-0][TaskInstance-0] - Alert client is already closed
   [INFO] 2022-12-26 11:07:00.735 +0800 com.zaxxer.hikari.HikariDataSource:[350] - [WorkflowInstance-0][TaskInstance-0] - DolphinScheduler - Shutdown initiated...
   [INFO] 2022-12-26 11:07:00.745 +0800 com.zaxxer.hikari.HikariDataSource:[352] - [WorkflowInstance-0][TaskInstance-0] - DolphinScheduler - Shutdown completed.
   [INFO] 2022-12-26 11:07:00.747 +0800 org.apache.dolphinscheduler.service.alert.AlertClientService:[80] - [WorkflowInstance-0][TaskInstance-0] - Alter client closing
   [INFO] 2022-12-26 11:07:00.748 +0800 org.apache.dolphinscheduler.remote.NettyRemotingClient:[334] - [WorkflowInstance-0][TaskInstance-0] - netty client closed
   [INFO] 2022-12-26 11:07:00.748 +0800 org.apache.dolphinscheduler.service.alert.AlertClientService:[82] - [WorkflowInstance-0][TaskInstance-0] - Alter client closed
   [WARN] 2022-12-26 11:07:00.748 +0800 org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[69] - [WorkflowInstance-0][TaskInstance-0] - WorkerHeartBeatTask task finished
   [INFO] 2022-12-26 11:07:00.749 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient:[111] - [WorkflowInstance-0][TaskInstance-0] - Worker registry client closed
   [INFO] 2022-12-26 11:07:00.749 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[99] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closing
   [INFO] 2022-12-26 11:07:00.749 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[101] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closed
   [INFO] 2022-12-26 11:07:00.749 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[155] - [WorkflowInstance-0][TaskInstance-0] - Worker server stopped, current cause: Recover from waiting failed, the current server status is RUNNING, will stop the server
   ```
   
   ### Version
   
   3.1.x
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #13310: [Bug] [Worker] When registry recover from SUSPENDED to RECONNECTED, the worker will crash

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #13310:
URL: https://github.com/apache/dolphinscheduler/issues/13310#issuecomment-1367980288

   Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
   * In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
   * If you haven't received a reply for a long time, you can [join our slack](https://s.apache.org/dolphinscheduler-slack) and send your question to channel `#troubleshooting`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] xiaoyong513 commented on issue #13310: [Bug] [Worker] When registry recover from SUSPENDED to RECONNECTED, the worker will crash

Posted by "xiaoyong513 (via GitHub)" <gi...@apache.org>.
xiaoyong513 commented on issue #13310:
URL: https://github.com/apache/dolphinscheduler/issues/13310#issuecomment-1631800003

   In the startup process, I noticed that the log in the "work" node contains the error message "NettyRemotingServer bind 1234 fail," while the log in the "master" node does not have the aforementioned error message.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] ruanwenjun closed issue #13310: [Bug] [Worker] When registry recover from SUSPENDED to RECONNECTED, the worker will crash

Posted by GitBox <gi...@apache.org>.
ruanwenjun closed issue #13310: [Bug] [Worker] When registry recover from SUSPENDED to RECONNECTED, the worker will crash
URL: https://github.com/apache/dolphinscheduler/issues/13310


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org