You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/12/27 10:21:46 UTC

[GitHub] [dolphinscheduler] KEN-LJQ opened a new issue, #13283: [Bug] [Master] Worker failover cause task repeat run

KEN-LJQ opened a new issue, #13283:
URL: https://github.com/apache/dolphinscheduler/issues/13283

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues.
   
   
   ### What happened
   
   We have a long run mapreduce task, which will duration a few hours, like 12hours. After we doing worker and master rolling restart, this mr task keep submit to yarn every 10 mins, and the old one failed to kill as we found the error log.
   Finally, this task repeat submit many mr instance cause out yarn cluster run out of resources.
   
   Our version is 2.0.7
   After, we diging into the code, we found something about the worker failover within `MasterRegistryClient`
   ```
   private boolean checkTaskAfterWorkerStart(List<Server> workerServers, TaskInstance taskInstance) {
           if (StringUtils.isEmpty(taskInstance.getHost())) {
               return false;
           }
   
           Date taskTime = taskInstance.getStartTime() == null ? taskInstance.getSubmitTime() : taskInstance.getStartTime();
   
           Date workerServerStartDate = getServerStartupTime(workerServers, taskInstance.getHost());
           if (workerServerStartDate != null) {
               return taskTime.after(workerServerStartDate);
           }
           return false;
       }
   ```
   The failover code use the `start_time`  of task intance,  instead of `submit_time`. And `start_time` will not update every `NEED_FAULT_TOLERANCE`, the `submit_time` is the updated one. Which cause task submit again and again, because this mr task will not finsih within the failover check interval, which is 10min
   
   And here is the code in version 3.x
   ```
   // The worker is active, may already send some new task to it
           if (taskInstance.getSubmitTime() != null && taskInstance.getSubmitTime()
               .after(needFailoverWorkerStartTime.get())) {
               LOGGER.info(
                   "The taskInstance's submitTime: {} is after the need failover worker's start time: {}, the taskInstance is newly submit, it doesn't need to failover",
                   taskInstance.getSubmitTime(),
                   needFailoverWorkerStartTime.get());
               return false;
           }
   ```
   
   ### What you expected to happen
   
   Worker failover should use `submit_time` to judge
   
   And also, we wanna to know that, if there is a hive sql task or shell task, and when worker failover happen, will this sql task repeat run, because we can not found the code about kill the hive sql. If not, may it be a problem?
   
   ### How to reproduce
   
   1. create process with a mr task, which run longer than the failover check interval
   2. submit a process instance
   3. restart worker
   4. check how many mr task submit on yarn
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   2.0.x
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] wcmolin commented on issue #13283: [Bug] [Master] Worker failover cause task repeat run

Posted by GitBox <gi...@apache.org>.
wcmolin commented on issue #13283:
URL: https://github.com/apache/dolphinscheduler/issues/13283#issuecomment-1384788473

   This seems to be this issue https://github.com/apache/dolphinscheduler/issues/12482, it has been fixed, can you use the 2.0.8 branch to test it. @KEN-LJQ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #13283: [Bug] [Master] Worker failover cause task repeat run

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #13283:
URL: https://github.com/apache/dolphinscheduler/issues/13283#issuecomment-1365788238

   Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
   * In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
   * If you haven't received a reply for a long time, you can [join our slack](https://s.apache.org/dolphinscheduler-slack) and send your question to channel `#troubleshooting`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] JinyLeeChina commented on issue #13283: [Bug] [Master] Worker failover cause task repeat run

Posted by GitBox <gi...@apache.org>.
JinyLeeChina commented on issue #13283:
URL: https://github.com/apache/dolphinscheduler/issues/13283#issuecomment-1381602535

   How about the progress of this issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] KEN-LJQ commented on issue #13283: [Bug] [Master] Worker failover cause task repeat run

Posted by GitBox <gi...@apache.org>.
KEN-LJQ commented on issue #13283:
URL: https://github.com/apache/dolphinscheduler/issues/13283#issuecomment-1384946950

   @wcmolin here is the reproduce step with shell task
   1. set master config `master.failover.interval` of `master.properties` to `1`
   2. create shell task with cmd `sleep 2m`
   3. run shell task
   4. while shell task running, restart worker server
   5. this moment shell task should be failover, and check what happend while every master failover check
   
   In this test , shell task should re-run every minute.
   The key step is, make sure the task run longer than master failover interval, that why I set master failover every minute and let shell sleep 2 minute
   
   Maybe you can try with this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] wcmolin commented on issue #13283: [Bug] [Master] Worker failover cause task repeat run

Posted by GitBox <gi...@apache.org>.
wcmolin commented on issue #13283:
URL: https://github.com/apache/dolphinscheduler/issues/13283#issuecomment-1384928881

   > 
   
   Can you teach me to build a shell task to test, I don't have the operating environment of mr task here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] JinyLeeChina closed issue #13283: [Bug] [Master] Worker failover cause task repeat run

Posted by GitBox <gi...@apache.org>.
JinyLeeChina closed issue #13283: [Bug] [Master] Worker failover cause task repeat run
URL: https://github.com/apache/dolphinscheduler/issues/13283


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] wcmolin commented on issue #13283: [Bug] [Master] Worker failover cause task repeat run

Posted by GitBox <gi...@apache.org>.
wcmolin commented on issue #13283:
URL: https://github.com/apache/dolphinscheduler/issues/13283#issuecomment-1384990673

   Sorry, I can't reproduce it. When to restart the worker server? 10 seconds or so at the start of the task? Or more than a minute. If possible, can you chat with me on WeChat: `wcmolin`
   
   here is the master's log:
   ```
   [INFO] 2023-01-17 15:59:28.968 org.apache.dolphinscheduler.server.master.processor.CacheProcessor:[52] - received command : CacheExpireCommand{CacheType=PROCESS_DEFINITION, cacheKey=8256494422080}
   [INFO] 2023-01-17 15:59:28.994 org.apache.dolphinscheduler.server.master.processor.CacheProcessor:[70] - cache evict, type:processDefinition, key:8256494422080
   [INFO] 2023-01-17 15:59:29.451 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[255] - find command 8, slot:0 :
   [INFO] 2023-01-17 15:59:29.453 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[208] - find one command: id: 8, type: START_PROCESS
   [INFO] 2023-01-17 15:59:29.524 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[221] - handle command end, command 8 process 85 start...
   [INFO] 2023-01-17 15:59:29.540 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1372] - add task to stand by list, task name:test_exit, task id:0, task code:8256472748480
   [INFO] 2023-01-17 15:59:29.557 org.apache.dolphinscheduler.service.process.ProcessService:[1100] - start submit task : test_exit, instance id:85, state: RUNNING_EXECUTION
   [INFO] 2023-01-17 15:59:29.597 org.apache.dolphinscheduler.service.process.ProcessService:[1113] - end submit task to db successfully:108 test_exit state:SUBMITTED_SUCCESS complete, instance id:85 state: RUNNING_EXECUTION  
   [INFO] 2023-01-17 15:59:29.611 org.apache.dolphinscheduler.server.master.runner.task.CommonTaskProcessor:[164] - master submit success, task id:108, task name:test_exit, process id:85
   [INFO] 2023-01-17 15:59:29.613 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1386] - remove task from stand by list, id: 108 name:test_exit
   [INFO] 2023-01-17 15:59:30.994 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=108, startTime=Tue Jan 17 15:59:30 CST 2023, host='10.66.76.129:1234', status=1, logPath='/Users/molin/dev/github/dolphinscheduler/logs/8256494422080_3/85/108.log', executePath='/tmp/dolphinscheduler/exec/process/7960944102080/8256494422080_3/85/108', processInstanceId='85'}
   [INFO] 2023-01-17 15:59:31.957 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223] - already exists handler process size:0
   [INFO] 2023-01-17 15:59:32.064 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230] - handle events 85 succeeded.
   [INFO] 2023-01-17 15:59:43.249 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] - failover execute started
   [INFO] 2023-01-17 15:59:43.262 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[74] - need failover hosts:[10.66.76.129:5678]
   [INFO] 2023-01-17 15:59:43.317 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[430] - start master[10.66.76.129:5678] failover, process list size:1
   [INFO] 2023-01-17 15:59:43.330 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[462] - master[10.66.76.129:5678] failover end, useTime:37ms
   [INFO] 2023-01-17 15:59:49.665 org.apache.dolphinscheduler.server.master.processor.TaskResponseProcessor:[72] - received command : TaskExecuteResponseCommand{taskInstanceId=108, processInstanceId=85, status=6, startTime=null, endTime=Tue Jan 17 15:59:49 CST 2023, host=null, logPath=null, executePath=null, processId=16894, appIds='', varPool=[]}
   [INFO] 2023-01-17 15:59:50.004 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223] - already exists handler process size:0
   [INFO] 2023-01-17 15:59:50.061 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230] - handle events 85 succeeded.
   [INFO] 2023-01-17 15:59:50.983 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] - handle process instance : 85 , events count:1
   [INFO] 2023-01-17 15:59:50.984 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] - already exists handler process size:0
   [INFO] 2023-01-17 15:59:50.986 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[314] - process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus: FAILURE task instance id: 108 process instance id: 85 context: null
   [INFO] 2023-01-17 15:59:50.992 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[428] - work flow 85 task 108 state:FAILURE 
   [INFO] 2023-01-17 15:59:51.031 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1299] - work flow process instance [id: 85, name:test_retry-3-20230117155929472], state change from RUNNING_EXECUTION to FAILURE, cmd type: START_PROCESS
   [INFO] 2023-01-17 15:59:51.048 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[501] - process:85 state FAILURE change to FAILURE
   [INFO] 2023-01-17 15:59:51.150 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[139] - process instance 85 finished.
   [INFO] 2023-01-17 15:59:52.013 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[219] - remove process instance: 85
   [INFO] 2023-01-17 15:59:52.622 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[239] - worker group node : /nodes/worker/default/10.66.76.129:1234 down.
   [INFO] 2023-01-17 15:59:52.622 org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener:[80] - worker node deleted : /nodes/worker/default/10.66.76.129:1234
   [INFO] 2023-01-17 15:59:52.633 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[196] - WORKER node deleted : /nodes/worker/default/10.66.76.129:1234
   [INFO] 2023-01-17 15:59:52.635 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[206] - path: /nodes/worker/default/10.66.76.129:1234 not exists
   [INFO] 2023-01-17 15:59:52.644 org.apache.dolphinscheduler.service.registry.RegistryClient:[164] - WORKER server dead , and /nodes/worker/default/10.66.76.129:1234 added to zk dead server path success
   [INFO] 2023-01-17 15:59:52.694 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[380] - start worker[10.66.76.129:1234] failover, task list size:0
   [INFO] 2023-01-17 15:59:52.695 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[406] - end worker[10.66.76.129:1234] failover, useTime:40ms
   [INFO] 2023-01-17 16:00:43.334 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] - failover execute started
   ```
   here is worker log:
   ```
   [INFO] 2023-01-17 15:59:30.624 org.apache.dolphinscheduler.server.worker.processor.TaskExecuteProcessor:[108] - received command : TaskExecuteRequestCommand{taskExecutionContext='{"taskInstanceId":108,"taskName":"test_exit","firstSubmitTime":"2023-01-17 15:59:29","startTime":null,"endTime":null,"taskType":"SHELL","host":null,"executePath":null,"logPath":null,"taskJson":null,"processId":0,"processDefineCode":8256494422080,"processDefineVersion":3,"appIds":null,"processInstanceId":85,"scheduleTime":null,"globalParams":null,"executorId":1,"cmdTypeIfComplement":0,"tenantCode":"molin","queue":"default","projectCode":7960944102080,"taskParams":"{\"resourceList\":[],\"localParams\":[],\"rawScript\":\"sleep 120\",\"dependence\":{},\"conditionResult\":{\"successNode\":[],\"failedNode\":[]},\"waitStartTimeout\":{},\"switchResult\":{}}","envFile":null,"environmentConfig":null,"definedParams":null,"taskAppId":null,"taskTimeoutStrategy":null,"taskTimeout":2147483647,"workerGroup":"default","de
 layTime":0,"currentExecutionStatus":null,"resources":{},"sqlTaskExecutionContext":{"warningGroupId":0,"connectionParams":null,"udfFuncTenantCodeMap":null,"defaultFS":null},"dataxTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"dependenceTaskExecutionContext":null,"sqoopTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"varPool":null,"dryRun":0,"paramsMap":null,"procedureTaskExecutionContext":{"connectionParams":null}}'}
   [INFO] 2023-01-17 15:59:30.881 org.apache.dolphinscheduler.common.utils.FileUtils:[121] - create dir success /tmp/dolphinscheduler/exec/process/7960944102080/8256494422080_3/85/108
   [INFO] 2023-01-17 15:59:30.881 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[129] - task instance local execute path : /tmp/dolphinscheduler/exec/process/7960944102080/8256494422080_3/85/108
   [INFO] 2023-01-17 15:59:30.978 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[146] - the task begins to execute. task instance id: 108
   [INFO] 2023-01-17 15:59:31.029 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[83] - shell task params {"resourceList":[],"localParams":[],"rawScript":"sleep 120","dependence":{},"conditionResult":{"successNode":[],"failedNode":[]},"waitStartTimeout":{},"switchResult":{}}
   [INFO] 2023-01-17 15:59:31.117 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[137] - raw script : sleep 120
   [INFO] 2023-01-17 15:59:31.117 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[138] - task execute path : /tmp/dolphinscheduler/exec/process/7960944102080/8256494422080_3/85/108
   [INFO] 2023-01-17 15:59:31.125 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[86] - tenantCode user:molin, task dir:85_108
   [INFO] 2023-01-17 15:59:31.126 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[91] - create command file:/tmp/dolphinscheduler/exec/process/7960944102080/8256494422080_3/85/108/85_108.command
   [INFO] 2023-01-17 15:59:31.126 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[117] - command : #!/bin/sh
   BASEDIR=$(cd `dirname $0`; pwd)
   cd $BASEDIR
   source /etc/profile
   /tmp/dolphinscheduler/exec/process/7960944102080/8256494422080_3/85/108/85_108_node.sh
   [INFO] 2023-01-17 15:59:31.143 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[287] - task run command: sudo -u molin sh /tmp/dolphinscheduler/exec/process/7960944102080/8256494422080_3/85/108/85_108.command
   [INFO] 2023-01-17 15:59:31.146 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[178] - process start, process id is: 16894
   [INFO] 2023-01-17 15:59:32.146 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[66] -  -> welcome to use bigdata scheduling system...
   Disconnected from the target VM, address: '127.0.0.1:56983', transport: 'socket'
   [INFO] 2023-01-17 15:59:49.578 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[202] - process has exited, execute path:/tmp/dolphinscheduler/exec/process/7960944102080/8256494422080_3/85/108, processId:16894 ,exitStatusCode:130 ,processWaitForStatus:true ,processExitValue:130
   [INFO] 2023-01-17 15:59:49.600 org.apache.dolphinscheduler.server.worker.WorkerServer:[193] - worker server is stopping ..., cause : shutdownHook
   [INFO] 2023-01-17 15:59:49.657 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[213] - task instance id : 108,task final status : FAILURE
   [INFO] 2023-01-17 15:59:49.664 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[276] - develop mode is: false
   [INFO] 2023-01-17 15:59:49.673 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[294] - exec local path: /tmp/dolphinscheduler/exec/process/7960944102080/8256494422080_3/85/108 cleared.
   [INFO] 2023-01-17 15:59:49.775 org.quartz.core.QuartzScheduler:[585] - Scheduler quartzScheduler_$_NON_CLUSTERED paused.
   [INFO] 2023-01-17 15:59:49.785 org.springframework.scheduling.quartz.SchedulerFactoryBean:[845] - Shutting down Quartz Scheduler
   [INFO] 2023-01-17 15:59:49.790 org.quartz.core.QuartzScheduler:[666] - Scheduler quartzScheduler_$_NON_CLUSTERED shutting down.
   [INFO] 2023-01-17 15:59:49.791 org.quartz.core.QuartzScheduler:[585] - Scheduler quartzScheduler_$_NON_CLUSTERED paused.
   [INFO] 2023-01-17 15:59:49.793 org.quartz.core.QuartzScheduler:[740] - Scheduler quartzScheduler_$_NON_CLUSTERED shutdown complete.
   [INFO] 2023-01-17 15:59:49.812 com.zaxxer.hikari.HikariDataSource:[350] - DolphinScheduler - Shutdown initiated...
   [INFO] 2023-01-17 15:59:49.864 com.zaxxer.hikari.HikariDataSource:[352] - DolphinScheduler - Shutdown completed.
   
   Process finished with exit code 0
   
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] KEN-LJQ commented on issue #13283: [Bug] [Master] Worker failover cause task repeat run

Posted by GitBox <gi...@apache.org>.
KEN-LJQ commented on issue #13283:
URL: https://github.com/apache/dolphinscheduler/issues/13283#issuecomment-1384912801

   @wcmolin Hi, I'm afraid that not the same question, and I tested branch `2.0.8-prepare` still has problem. Cause the root problem is using `start_time` to judge whether process should failover, although it had just failover


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org