You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2021/08/28 17:54:51 UTC

[GitHub] [dolphinscheduler] reele opened a new issue #6055: [Bug][Fault-tolerant] Sub-process and Dependent tasks may fall into an endless-loop by recovery from kill

reele opened a new issue #6055:
URL: https://github.com/apache/dolphinscheduler/issues/6055


   **Describe the bug**
   When recovery a stopping instance, sub-process-task's state may be 'KILL', but the sub-process-instance is already submitted by RECOVER_TOLERANCE_FAULT_PROCESS command,
   the SubProcessTaskExecThread.waitTaskQuit() will [return directly](https://github.com/apache/dolphinscheduler/blob/e0eea995200f673d6406ec62c464c77f1d5b6171/dolphinscheduler-server/src/main/java/org/apache/dolphinscheduler/server/master/runner/SubProcessTaskExecThread.java#L128), and [set task state with sub-process's state](https://github.com/reele/dolphinscheduler/blob/3215cfb9f7c62bef7fa197b37ffc38cedd2c7ef5/dolphinscheduler-server/src/main/java/org/apache/dolphinscheduler/server/master/runner/SubProcessTaskExecThread.java#L66) (even if the sub-process is running), so the sub-process-task will ended with an unfinished state,
   so the parent thread MasterExecThread will fall into an endless-loop.
   
   
   **To Reproduce**
   This is a log example:
   In the beginning,
   process TRIGGER_D_DW_STS(id:3342, state:READY_STOP) has a sub-process-task STS_D_T88 (id:62930, state:KILL)
   sub-process-task STS_D_T88 (id:62930) point to process STS_D_T88 (id:3375, state:READY_STOP)
   
   at time 2021-07-31 19:19:55.010, sub-process-task STS_D_T88's state changed from KILL to READY_STOP, and then there is a deadloop forever.
   
   `[INFO] 2021-07-31 19:19:53.236 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[153] - start master exec thread , split DAG ...
   [INFO] 2021-07-31 19:19:53.792 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[145] - find one command: id: 9515, type: RECOVER_TOLERANCE_FAULT_PROCESS
   [INFO] 2021-07-31 19:19:53.809 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[242] - process 3337 start to complement 2021-07-30 00:00:00 data
   [INFO] 2021-07-31 19:19:53.844 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[315] - prepare process :3337 end
   [INFO] 2021-07-31 19:19:53.919 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[792] - add task to stand by list: TRIGGER_D_DW_STS
   [INFO] 2021-07-31 19:19:53.933 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[805] - remove task from stand by list: TRIGGER_D_DW_STS
   [INFO] 2021-07-31 19:19:53.945 org.apache.dolphinscheduler.service.process.ProcessService:[845] - start submit task : TRIGGER_D_DW_STS, instance id:3337, state: READY_STOP
   [INFO] 2021-07-31 19:19:53.950 org.apache.dolphinscheduler.service.process.ProcessService:[858] - end submit task to db successfully:TRIGGER_D_DW_STS state:KILL complete, instance id:3337 state: READY_STOP  
   [INFO] 2021-07-31 19:19:53.959 org.apache.dolphinscheduler.server.master.runner.SubProcessTaskExecThread:[121] - wait sub work flow: TRIGGER_D_DW_STS complete
   [INFO] 2021-07-31 19:19:53.959 org.apache.dolphinscheduler.server.master.runner.SubProcessTaskExecThread:[124] - sub work flow task TRIGGER_D_DW_STS already complete. task state:KILL, parent work flow instance state:READY_STOP
   [INFO] 2021-07-31 19:19:53.963 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[153] - start master exec thread , split DAG ...
   [INFO] 2021-07-31 19:19:53.969 org.apache.dolphinscheduler.server.master.runner.SubProcessTaskExecThread:[71] - subflow task :TRIGGER_D_DW_STS id:62897, process id:3337, exec thread completed 
   [INFO] 2021-07-31 19:19:53.975 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[145] - find one command: id: 9516, type: RECOVER_TOLERANCE_FAULT_PROCESS
   [INFO] 2021-07-31 19:19:53.989 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[315] - prepare process :3342 end
   [INFO] 2021-07-31 19:19:53.994 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[153] - start master exec thread , split DAG ...
   [INFO] 2021-07-31 19:19:54.001 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[792] - add task to stand by list: STS_D_T88
   [INFO] 2021-07-31 19:19:54.002 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[145] - find one command: id: 9517, type: RECOVER_TOLERANCE_FAULT_PROCESS
   [INFO] 2021-07-31 19:19:54.002 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[805] - remove task from stand by list: STS_D_T88
   [INFO] 2021-07-31 19:19:54.019 org.apache.dolphinscheduler.service.process.ProcessService:[845] - start submit task : STS_D_T88, instance id:3342, state: READY_STOP
   [INFO] 2021-07-31 19:19:54.023 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[315] - prepare process :3375 end
   [INFO] 2021-07-31 19:19:54.025 org.apache.dolphinscheduler.service.process.ProcessService:[858] - end submit task to db successfully:STS_D_T88 state:KILL complete, instance id:3342 state: READY_STOP  
   [INFO] 2021-07-31 19:19:54.030 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[153] - start master exec thread , split DAG ...
   [INFO] 2021-07-31 19:19:54.037 org.apache.dolphinscheduler.server.master.runner.SubProcessTaskExecThread:[121] - wait sub work flow: STS_D_T88 complete
   [INFO] 2021-07-31 19:19:54.038 org.apache.dolphinscheduler.server.master.runner.SubProcessTaskExecThread:[124] - sub work flow task STS_D_T88 already complete. task state:KILL, parent work flow instance state:READY_STOP
   [INFO] 2021-07-31 19:19:54.042 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[792] - add task to stand by list: CDB_T88_EMPLY_BIZ_STAT_SUM_CDM_1
   [INFO] 2021-07-31 19:19:54.044 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[805] - remove task from stand by list: CDB_T88_EMPLY_BIZ_STAT_SUM_CDM_1
   [INFO] 2021-07-31 19:19:54.053 org.apache.dolphinscheduler.server.master.runner.DependentTaskExecThread:[76] - dependent task start
   [INFO] 2021-07-31 19:19:54.058 org.apache.dolphinscheduler.service.process.ProcessService:[845] - start submit task : CDB_T88_EMPLY_BIZ_STAT_SUM_CDM_1, instance id:3375, state: READY_STOP
   [INFO] 2021-07-31 19:19:54.060 org.apache.dolphinscheduler.server.master.runner.SubProcessTaskExecThread:[71] - subflow task :STS_D_T88 id:62930, process id:3342, exec thread completed 
   [INFO] 2021-07-31 19:19:54.063 org.apache.dolphinscheduler.service.process.ProcessService:[858] - end submit task to db successfully:CDB_T88_EMPLY_BIZ_STAT_SUM_CDM_1 state:KILL complete, instance id:3375 state: READY_STOP  
   [INFO] 2021-07-31 19:19:54.063 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[315] - prepare process :3375 end
   [INFO] 2021-07-31 19:19:54.081 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[498] - task CDB_T88_EMPLY_BIZ_STAT_SUM_CDM_1 stopped, the state is KILL
   [INFO] 2021-07-31 19:19:54.091  - [taskAppId=TASK-7187-3375-63133]:[133] - wait depend task : CDB_T88_EMPLY_BIZ_STAT_SUM_CDM_1 complete
   [INFO] 2021-07-31 19:19:54.948 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :TRIGGER_D_DW_STS, id:62897 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:19:55.010 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :STS_D_T88, id:62930 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:19:55.053 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :CDB_T88_EMPLY_BIZ_STAT_SUM_CDM_1, id:63133 complete, state is KILL
   [ERROR] 2021-07-31 19:19:55.088 org.apache.dolphinscheduler.common.utils.DateUtils:[131] - error while parse date:null
   java.lang.NullPointerException: text
   	at java.util.Objects.requireNonNull(Objects.java:228)
   	at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1848)
   	at java.time.LocalDateTime.parse(LocalDateTime.java:492)
   	at org.apache.dolphinscheduler.common.utils.DateUtils.parse(DateUtils.java:128)
   	at org.apache.dolphinscheduler.common.utils.DateUtils.stringToDate(DateUtils.java:144)
   	at org.apache.dolphinscheduler.common.utils.DateUtils.getScheduleDate(DateUtils.java:240)
   	at org.apache.dolphinscheduler.server.master.runner.MasterExecThread.isComplementEnd(MasterExecThread.java:749)
   	at org.apache.dolphinscheduler.server.master.runner.MasterExecThread.getProcessInstanceState(MasterExecThread.java:695)
   	at org.apache.dolphinscheduler.server.master.runner.MasterExecThread.updateProcessInstanceState(MasterExecThread.java:762)
   	at org.apache.dolphinscheduler.server.master.runner.MasterExecThread.runProcess(MasterExecThread.java:922)
   	at org.apache.dolphinscheduler.server.master.runner.MasterExecThread.executeProcess(MasterExecThread.java:200)
   	at org.apache.dolphinscheduler.server.master.runner.MasterExecThread.run(MasterExecThread.java:181)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   	at java.lang.Thread.run(Thread.java:745)
   [INFO] 2021-07-31 19:19:55.088 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[764] - work flow process instance [id: 3375, name:STS_D_T88-1-1627719975621], state change from READY_STOP to STOP, cmd type: RECOVER_TOLERANCE_FAULT_PROCESS
   [INFO] 2021-07-31 19:19:55.102 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[925] - process:3375 end, state :STOP
   [INFO] 2021-07-31 19:19:55.959 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :TRIGGER_D_DW_STS, id:62897 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:19:56.017 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :STS_D_T88, id:62930 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:19:56.060 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[764] - work flow process instance [id: 3375, name:STS_D_T88-1-1627719975621], state change from READY_STOP to STOP, cmd type: RECOVER_TOLERANCE_FAULT_PROCESS
   [INFO] 2021-07-31 19:19:56.073 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[925] - process:3375 end, state :STOP
   [INFO] 2021-07-31 19:19:56.968 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :TRIGGER_D_DW_STS, id:62897 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:19:57.024 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :STS_D_T88, id:62930 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:19:57.978 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :TRIGGER_D_DW_STS, id:62897 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:19:58.032 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :STS_D_T88, id:62930 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:19:58.988 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :TRIGGER_D_DW_STS, id:62897 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:19:59.040 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :STS_D_T88, id:62930 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:19:59.998 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :TRIGGER_D_DW_STS, id:62897 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:20:00.047 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :STS_D_T88, id:62930 complete, state is READY_STOP 
   [INFO] 2021-07-31 19:20:01.007 org.apache.dolphinscheduler.server.master.runner.MasterExecThread:[864] - task :TRIGGER_D_DW_STS, id:62897 complete, state is READY_STOP 
   `
   
   **Expected behavior**
   I think sub-process and dependent tasks should always submit with the SUBMITTED_SUCCESS status. at [ProcessService.getSubmitTaskState](https://github.com/apache/dolphinscheduler/blob/e0eea995200f673d6406ec62c464c77f1d5b6171/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/process/ProcessService.java#L1280)
   
   **Screenshots**
   If applicable, add screenshots to help explain your problem.
   
   
   **Which version of Dolphin Scheduler:**
    -[any]
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [dolphinscheduler] reele removed a comment on issue #6055: [Bug][Fault-tolerant] Sub-process and Dependent tasks may fall into an endless-loop by recovery from kill

Posted by GitBox <gi...@apache.org>.
reele removed a comment on issue #6055:
URL: https://github.com/apache/dolphinscheduler/issues/6055#issuecomment-907669763


   [There is the PR](https://github.com/apache/dolphinscheduler/pull/6056)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #6055: [Bug][Fault-tolerant] Sub-process and Dependent tasks may fall into an endless-loop by recovery from kill

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #6055:
URL: https://github.com/apache/dolphinscheduler/issues/6055#issuecomment-907664267


   Hi:
   * Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
   * In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
   * If you haven't received a reply for a long time, you can subscribe to the developer's email,Mail subscription steps reference https://dolphinscheduler.apache.org/zh-cn/community/development/subscribe.html ,Then write the issue URL in the email content and send question to dev@dolphinscheduler.apache.org.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [dolphinscheduler] reele closed issue #6055: [Bug][Fault-tolerant&Pause] Sub-process and Dependent tasks may fall into an endless-loop by submitting with a finished state

Posted by GitBox <gi...@apache.org>.
reele closed issue #6055:
URL: https://github.com/apache/dolphinscheduler/issues/6055


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [dolphinscheduler] reele commented on issue #6055: [Bug][Fault-tolerant] Sub-process and Dependent tasks may fall into an endless-loop by recovery from kill

Posted by GitBox <gi...@apache.org>.
reele commented on issue #6055:
URL: https://github.com/apache/dolphinscheduler/issues/6055#issuecomment-907669763


   [There is the PR](https://github.com/apache/dolphinscheduler/pull/6056)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org