You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@dolphinscheduler.apache.org by Yichao Yang <10...@qq.com> on 2020/07/12 14:05:30 UTC

Re： [summer-2020]Plan Of Force-task-success 强制成功项目方案

Hi,


Is it better to remove the precondition of the status of the task instance fail?
And no matter what the current status of this instance is, it can be marked as successful by user.


Best,
Yichao Yang




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "dev"                                                                                    <wenhemin@apache.org&gt;;
发送时间:&nbsp;2020年7月12日(星期天) 上午8:46
收件人:&nbsp;"dev"<dev@dolphinscheduler.apache.org&gt;;

主题:&nbsp;Re: [summer-2020]Plan Of Force-task-success 强制成功项目方案



Hi!

Workflow continues to execute after forced success, Is it better to add an
option to continue execution?
Sometimes forced success is to solve the nodes that depend on this task,
can continue to execute.

And, whether to need support, continue to execute after the task be forced
success.
I think, no need, users can choose their own operation.

About the discussion section:
1. No need, determined by the user.
2.Refer to the reply above.

--------------------
DolphinScheduler(Incubator) Commtter
Hemin Wen&nbsp; 温合民
wenhemin@apache.org
--------------------


Zhou Zheng <1606079777@qq.com&gt; 于2020年7月11日周六 下午8:17写道：

&gt; Dear all,
&gt; this is a new version of Force-task-success plan.
&gt;
&gt;
&gt; demand analysis
&gt;
&gt;
&gt; First, it is necessary to distinguish between the forced success at the
&gt; task instance level and the forced success at the workflow level:
&gt;
&gt;
&gt; Suppose there are three workflows DAG1, DAG2, DAG3 The dependencies are:
&gt; DAG1 -&amp;gt; DAG2; DAG1 -&amp;gt; DAG3
&gt; Then there are 3 nodes t_a1 -&amp;gt; t_a2 -&amp;gt; t_a3 in DAG1.
&gt; When executed at a certain time, the task instance of t_a2 fails:
&gt;
&gt;
&gt; Task instance-level forced success: After t_a2 is successfully forced,
&gt; t_a3 starts to execute. If t_a3 is successfully executed, DAG2 and DAG3
&gt; start to execute.
&gt; Workflow level forced success: After forcing DAG1 to succeed, t_a2 status
&gt; is successful, and t_a3 is not executed. DAG2 and DAG3 start to execute.
&gt;
&gt;
&gt;
&gt;
&gt; The next thing to achieve is the mandatory success at the task instance
&gt; level. For a failed task instance, the user initiates a mandatory
&gt; successful request for the specific instance, and then the system continues
&gt; to execute subsequent dependencies. Subsequent tasks may only be part of
&gt; the entire DAG workflow instance.
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; Details
&gt;
&gt;
&gt; 1. Force successful system-level requirements
&gt;
&gt;
&gt; -Mark the corresponding failed task instance in the database, and append
&gt; the log at the same time, and the task will not be re-executed
&gt; -Continue to run subsequent tasks that satisfy dependencies, run according
&gt; to normal logic and record logs, and update the status and operation type
&gt; of the corresponding workflow instance
&gt;
&gt;
&gt; 2. After the user chooses to force the success, continue to run the
&gt; parameters
&gt;
&gt;
&gt; -The parameters such as failure strategy, notification strategy, priority,
&gt; etc. still continue the original settings.
&gt; -Considering that a large number of subsequent tasks may be executed at
&gt; the same time, thereby excessively consuming system resources, a parameter
&gt; needs to be added for the user to select whether the subsequent tasks are
&gt; executed in parallel or serially.
&gt;
&gt;
&gt; 3. About sub_process
&gt;
&gt;
&gt; -For example, there is currently a workflow DAG1, which contains three
&gt; nodes t_a1 -&amp;gt; t_a2 -&amp;gt; t_a3, and the type of t_a2 is sub_process,
&gt; which contains three nodes t_b1 -&amp;gt; t_b2 -&amp;gt; t_b3, in this run t_b1
&gt; Failed.
&gt;
&gt;
&gt; - If the user chooses to force the success of t_b1, then it will continue
&gt; to execute t_b2 -&amp;gt; t_b3, if both are successful, it will continue to
&gt; execute t_a3.
&gt; - If the user chooses to force the success of t_a2, then t_a3 will
&gt; continue to execute.
&gt;
&gt;
&gt; 4. About the situation that the work DAG is modified after the force
&gt; succeeds
&gt;
&gt;
&gt; -Due to limited resources (CPU and memory), some commands may be
&gt; backlogged when the load is relatively large; or the fault of the master
&gt; node triggers fault tolerance. When the successful command has not been
&gt; executed, the user modifies the corresponding workflow definition or
&gt; workflow instance. Because we only execute the subsequent tasks in the
&gt; workflow instance, it is okay to modify the ProcessDefinition. When
&gt; modifying the ProcessInstance, it can only be executed for the modified
&gt; content.
&gt;
&gt;
&gt;
&gt;
&gt; Achieve expectations
&gt; From the perspective of user and system interaction, a complete mandatory
&gt; success process looks like this:
&gt;
&gt;
&gt; Triggering conditions
&gt; The user requests a forced and successful operation on a failed task
&gt; instance
&gt;
&gt;
&gt; Preconditions
&gt; 1. The status of the task instance is "Failed"
&gt;
&gt;
&gt; Post-conditions
&gt; 1. The system updates the status of task instances and appends logs
&gt; 2. The system continues to execute the node to satisfy the dependent tasks
&gt; 3. The system updates the status and operation type of the corresponding
&gt; workflow instance
&gt;
&gt;
&gt; Normal Process
&gt; 1. The user request is forced to succeed
&gt; 2. The system prompts the user to select "parallel" or "serial"
&gt; execution&amp;nbsp;
&gt; 3. The user fills in the parameters and confirms&amp;nbsp;
&gt; 4. The system adds a new command, waiting for the command to execute
&gt;
&gt;
&gt; Expansion process
&gt; 4a-1. User request to view task instance list
&gt; 4a-2. (If the command has been executed) The status of the instance in the
&gt; list returned by the system is "forced success"
&gt; 4a-3. (If the command is not completed) The status of the instance in the
&gt; list returned by the system is still "Failed"
&gt; 4b-1. User requests to view a list of workflow instances
&gt; 4b-2. The running type of the corresponding instance in the list returned
&gt; by the system is "start execution from the forced success node", and the
&gt; status is maintained with various status information (such as running,
&gt; suspended, failed, successful, etc.) in the normal execution flow of the
&gt; command Unanimous
&gt;
&gt;
&gt; Special needs
&gt; See the "Details" section above
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; solution
&gt; To sum up is to continue the execute idea of the project itself, the
&gt; api-server inserts commands, the master parses and monitors, and hands over
&gt; to the worker to update the status of the specified task instance, and
&gt; execute subsequent tasks that satisfy the dependencies.
&gt;
&gt;
&gt; For the data layer:
&gt;
&gt;
&gt; -A new enumeration value needs to be added to the command_type field in
&gt; the t_ds_command table to mark this as the command used to force success
&gt; -A new enumeration value should also be added to the state field in the
&gt; t_ds_task_instance table, marking this as the task instance was forced to
&gt; succeed
&gt; -An enumeration value is added to the command_type field in the
&gt; t_ds_process_instance table, indicating that this is to start execution
&gt; from the node that successfully forced
&gt;
&gt;
&gt;
&gt;
&gt; For api-server:
&gt;
&gt;
&gt; -Add a new interface to the front-end to trigger a forced successful
&gt; operation. The api-server will insert a command into the Command table of
&gt; the database.
&gt;
&gt;
&gt; For master-server:
&gt;
&gt;
&gt; -After detecting the newly inserted command in the database, the DAG is
&gt; parsed and divided, and the tasks are passed to the worker through netty
&gt;
&gt;
&gt; For worker-server
&gt;
&gt;
&gt; -The worker needs to perform forced successful processing on the specified
&gt; node, that is, update the status of task_instance in the database, and add
&gt; the part of the previous log to add the forced success. For other nodes,
&gt; just run normally according to the logic before the project.
&gt;
&gt;
&gt;
&gt; ------
&gt;
&gt;
&gt; 大家好，这是新的强制成功（后端）的项目方案
&gt; （原文使用md格式写的，为了阅读效果，大家可以访问：
&gt; https://isrc.iscas.ac.cn/gitlab/summer2020/students/proj-2002010/blob/master/docs/%E5%91%A8%E6%8A%A57.5~7.11.md
&gt; ）
&gt;
&gt;
&gt; 需求分析
&gt;
&gt;
&gt; 首先，在这里需要区分一下任务实例级别的强制成功和工作流级别的强制成功：
&gt;
&gt;
&gt; 假设有三个工作流 DAG1, DAG2, DAG3 依赖关系是： DAG1 -&amp;gt; DAG2；DAG1 -&amp;gt; DAG3
&gt; 然后 DAG1 里面有 3 个节点 t_a1 -&amp;gt; t_a2 -&amp;gt; t_a3。
&gt; 当某次执行的时候， t_a2 的任务实例失败:
&gt;
&gt;
&gt; 任务实例级强制成功：对 t_a2 强制成功后， t_a3 开始执行，如果 t_a3 执行成功，则 DAG2 和 DAG3 开始执行。
&gt; 工作流级别强制成功: 对 DAG1 强制成功后， t_a2 状态为成功， t_a3 不执行。DAG2 和 DAG3 开始执行。
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; 接下来要实现的就是任务实例级别的强制成功。对于某一个失败的任务实例，用户针对该特定实例发起强制成功的请求，然后系统继续执行后续依赖，后续的任务可能只是整个DAG工作流实例的一部分。
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; 细节问题
&gt;
&gt;
&gt; 1. 强制成功的系统级需求
&gt;
&gt;
&gt; - 在数据库标记对应的失败任务实例，同时追加日志，并不会重新执行该任务
&gt; - 继续运行后续的满足依赖的任务，按正常的逻辑运行并记录日志，同时更新对应工作流实例的状态和运行类型
&gt;
&gt;
&gt; 2. 关于用户选择强制成功之后，继续往后的运行参数
&gt;
&gt;
&gt; -&amp;nbsp; 失败策略、通知策略、优先级等参数仍然延续最初的设置。
&gt; -&amp;nbsp; 考虑到可能出现后续大量任务同时执行的情况，从而过度消耗掉系统资源，需要增加一个参数让用户选择后续任务并行还是串行执行。
&gt;
&gt;
&gt; 3. 关于sub_process的情况
&gt;
&gt;
&gt; - 比如：当前有一个工作流DAG1，它包含三个节点 t_a1 -&amp;gt; t_a2 -&amp;gt;
&gt; t_a3，并且t_a2的类型是sub_process，其中包含三个节点t_b1 -&amp;gt; t_b2 -&amp;gt;
&gt; t_b3，在这一次的运行中t_b1失败了。
&gt;
&gt;
&gt; -- 如果用户选择强制成功 t_b1，那么就会接着执行 t_b2 -&amp;gt; t_b3，如果这两个都成功了，就会继续执行 t_a3。
&gt; -- 如果用户选择强制成功 t_a2，那么就会继续执行 t_a3。
&gt;
&gt;
&gt; 4. 关于强制成功后工作DAG被修改的情况
&gt;
&gt;
&gt; - 因为资源（CPU和内存）有限，当负载比较大的时候可能有些command被积压了；或者master节点故障触发了容错
&gt; 。当强制成功的command还未执行时，用户便修改了对应的工作流定义或者工作流实例。因为我们只执行工作流实例中后续的任务，所以修改ProcessDefinition没关系，当修改ProcessInstance时，也只能针对修改后的内容执行。
&gt;
&gt;
&gt;
&gt;
&gt; 实现预期
&gt; 从用户和系统交互的角度上来看，对一个完整的强制成功流程是这样的：
&gt;
&gt;
&gt; 触发条件
&gt; 用户请求对某一个失败的任务实例进行强制成功的操作
&gt;
&gt;
&gt; 前置条件
&gt; 1. 任务实例的状态为“失败”
&gt;
&gt;
&gt; 后置条件
&gt; 1. 系统更新任务实例的状态，追加日志
&gt; 2. 系统继续执行该节点后满足依赖的任务
&gt; 3. 系统更新对应工作流实例的状态和运行类型
&gt;
&gt;
&gt; 正常流程
&gt; 1. 用户请求强制成功
&gt; 2. 系统提示用户选择”并行“或者”串行“执行
&gt; 3. 用户填写参数并确定
&gt; 4. 系统新增一条命令，等待该命令的执行
&gt;
&gt;
&gt; 扩展流程
&gt; 4a-1. 用户请求查看任务实例列表
&gt; 4a-2. （若命令已执行完）系统返回的列表中该实例的状态为“强制成功”
&gt; 4a-3. （若命令未执行完）系统返回的列表中该实例的状态仍为“失败”
&gt; 4b-1. 用户请求查看工作流实例列表
&gt; 4b-2.
&gt; 系统返回的列表中对应实例的运行类型为“从强制成功节点开始执行”，状态跟命令正常执行流程中的各种状态信息（比如正在运行，暂停，失败，成功等等）保持一致
&gt;
&gt;
&gt; 特殊需求
&gt; 见上文’细节问题‘部分
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; 解决方案
&gt;
&gt; 总结起来就是延续项目本身的execute的思路，api-server插入命令，master解析和监控，交由worker去更新指定任务实例的状态，并且执行后续满足依赖的任务。
&gt;
&gt;
&gt; 对data层而言：
&gt;
&gt;
&gt; - t_ds_command 表中 command_type 字段需要加入一个新的枚举值，用来标记这是用于强制成功的命令
&gt; - t_ds_task_instance 表中 state 字段也要加入一个新的枚举值，标记这是该任务实例被强制成功了
&gt; - t_ds_process_instance表中&amp;nbsp; command_type 字段增加一个枚举值，表示这是从强制成功的节点开始执行
&gt;
&gt;
&gt;
&gt;
&gt; 对api-server而言：
&gt;
&gt;
&gt; - 增加一个新的接口，提供给前端用来触发强制成功的操作，api-server会向数据库的Command表插入一条命令。
&gt;
&gt;
&gt; 对master-server而言：
&gt;
&gt;
&gt; - 检测到数据库新插入的command之后解析、切分DAG，将任务都通过netty传递到worker
&gt;
&gt;
&gt; 对worker-server而言
&gt;
&gt;
&gt; -
&gt; worker需要对指定节点进行强制成功的处理，即更新数据库中task_instance的状态，同时增加在之前的日志中追加强制成功的部分。对于其他的节点，则按照项目之前的逻辑正常运行即可。

Re: Re： [summer-2020]Plan Of Force-task-success 强制成功项目方案

Posted by Zhou Zheng <16...@qq.com>.

I think it is a good idea.&nbsp;
Since the requirements in Summer-2020 are for failed tasks, I didn't think that much.&nbsp;
Maybe it will&nbsp; be added in later iterations.




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "dev"                                                                                    <1048262223@qq.com&gt;;
发送时间:&nbsp;2020年7月12日(星期天) 晚上10:05
收件人:&nbsp;"dev"<dev@dolphinscheduler.apache.org&gt;;

主题:&nbsp;Re： [summer-2020]Plan Of Force-task-success 强制成功项目方案



Hi,


Is it better to remove the precondition of the status of the task instance fail?
And no matter what the current status of this instance is, it can be marked as successful by user.


Best,
Yichao Yang