You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@dolphinscheduler.apache.org by Mad <10...@qq.com> on 2020/07/01 05:01:38 UTC

任务结果/变量传递方案讨论

技术方案:
 
核心:
 
通过动态修改若干个全局变量的值,来进行任务节点之间的“任务结果/变量传递”;
 
原因:
 
1、全局变量对任务流程中的所有任务节点可见。
 
2、任务实例执行流程为串行,则任务节点之间的任务结果和变量的传递不会同时发生,可以很容易地和已分配的有限个全局变量构成映射关系。
 
具体实现:
 
1、顺序遍历一次任务节点,并初始化全局参数个数 n=0,当遍历到传递参数的上游节点,则 n+=需要传递参数的个数,遍历到传递参数的下游节点,则 n -=需要传递参数的个数,记录该过程中 n 的最大值,令 N=n_max。创建 N 个全局变量,全局变量的命名可以为保留字段,也可以为随机字符串,例如{"G1","G2","G3",…,"GN"},并添加到集合U中。
 
2、串行执行任务节点task_A,并找到需要传递参数的下游任务节点task_B,若task_A不需要传递参数,则跳过以下步骤。
 
3、将task_B中所需接收的M个参数与集合U中取出的M个全局变量构成映射关系{"Param1":"Gu+1", "Param2":"Gu+2", "Param3":"Gu+3", …., "ParamM":"Gu+M"},M<=N;
 
4、task_A执行完后,将参数根据步骤3中构建的映射关系同步至全局变量。
 
6、task_B执行之前,将全局变量根据步骤3中构建的映射关系同步至参数。
 
7、task_B执行完后,将参数对应全局变量重新添加回集合U中。
 
例子:
 

 
执行流程如上图所示
 

 
参数传递过程如上图所示
 
A节点传递a,b两个参数至C节点;
 
B节点传递c,d,e三个参数至C节点;
 
1、N=5,U={G1,G2,G3,G4,G5};
 
2、执行A任务节点,找到C节点为下游;
 
3、得到映射{a:G1, b:G2};剩余U={G3,G4,G5};
 
4、A执行完后,同步a、b至G1、G2;
 
5、执行B任务节点,找到C节点为下游;
 
6、得到映射{c:G3, d:G4, e:G5};剩余U={};
 
7、B执行完后,同步c、d、e至G3、G4、G5;
 
8、执行C任务节点,同步G1、G2、G3、G4、G5至a、b、c、d、e;U={G1,G2,G3,G4,G5};

回复: 回复: 任务结果/变量传递方案讨论

Posted by Mad <10...@qq.com>.
I think it's right, just make it static as you said

Re: 回复: 任务结果/变量传递方案讨论

Posted by wu shaoj <ga...@apache.org>.
Don't understand it cleayly. 
'The execution process of task instances is serial' means that task instance executes one by one, right? Don't use the ambiguous presentation.
Every DAG should have one global variable set, let's say it's {key1,key2,key3,key4}, task under the DAG can read & update one of them, right? 
The global variable set should be static, not dynamic I suggest.


On 2020/7/2, 17:02, "Mad" <10...@qq.com> wrote:

    This is the URL of the image in the project proposal:https://isrc.iscas.ac.cn/gitlab/summer2020/students/proj-2002015/-/wikis/pic


    About question 1:
    'The execution process of task instances is serial' means that no variables of multiple nodes will be passed to the same variable of the same node,&nbsp;Although node A and node B can run at the same time, it can still be considered serial. For example, A and B are passed parameters to C after execution, and A is passed to C through global variables after execution. This process has nothing to do with B Conflict.


    About question 2:
    I think you are right, each parameter corresponds to a global variable one by one, and the function can be easily achieved, but this will cause the global variable to be wasted (although it does not take up much memory), the method of obtaining the number of global variables Similar to the variable pool, the number of global variables can be saved to the greatest extent, and in some extreme cases, a lot of global variables can be saved.

回复: 任务结果/变量传递方案讨论

Posted by Mad <10...@qq.com>.
This is the URL of the image in the project proposal:https://isrc.iscas.ac.cn/gitlab/summer2020/students/proj-2002015/-/wikis/pic


About question 1:
'The execution process of task instances is serial' means that no variables of multiple nodes will be passed to the same variable of the same node,&nbsp;Although node A and node B can run at the same time, it can still be considered serial. For example, A and B are passed parameters to C after execution, and A is passed to C through global variables after execution. This process has nothing to do with B Conflict.


About question 2:
I think you are right, each parameter corresponds to a global variable one by one, and the function can be easily achieved, but this will cause the global variable to be wasted (although it does not take up much memory), the method of obtaining the number of global variables Similar to the variable pool, the number of global variables can be saved to the greatest extent, and in some extreme cases, a lot of global variables can be saved.

Re: 任务结果/变量传递方案讨论

Posted by wu shaoj <ga...@apache.org>.
The solution is too complex that there will be many limits. 
So the following questions comes to my mind:
1. What's the meaning of ' The execution process of task instances is serial'. I think it might be parallel. 
2. Why should we know the number of global parameters? I think every task should only get it own variable value, calc it and put to global variables set. Nothing more need to do 


On 2020/7/2, 15:14, "lidong dai" <da...@gmail.com> wrote:

    hi ,
     the pic can't been seen,  please upload the pic to github , and show the
    pic url here, by the way, as a global project , English is needed, you can
    add English Desc use Google/Baidu translate tool.



    Best Regards
    ---------------
    DolphinScheduler(Incubator) PPMC
    Lidong Dai 代立冬
    dailidong66@gmail.com
    ---------------


    wu shaoj <ga...@apache.org> 于2020年7月1日周三 下午8:29写道:

    > Could you translate your solution in english ?
    >
    > From: Mad <10...@qq.com>
    > Reply-To: "dev@dolphinscheduler.apache.org" <
    > dev@dolphinscheduler.apache.org>
    > Date: Wednesday, July 1, 2020 at 13:03
    > To: dev <de...@dolphinscheduler.apache.org>
    > Subject: 任务结果/变量传递方案讨论
    >
    >
    > 技术方案:
    >
    > 核心:
    >
    > 通过动态修改若干个全局变量的值,来进行任务节点之间的“任务结果/变量传递”;
    >
    > 原因:
    >
    > 1、全局变量对任务流程中的所有任务节点可见。
    >
    > 2、任务实例执行流程为串行,则任务节点之间的任务结果和变量的传递不会同时发生,可以很容易地和已分配的有限个全局变量构成映射关系。
    >
    > 具体实现:
    >
    > 1、顺序遍历一次任务节点,并初始化全局参数个数 n=0,当遍历到传递参数的上游节点,则 n+=需要传递参数的个数,遍历到传递参数的下游节点,则 n
    > -=需要传递参数的个数,记录该过程中 n 的最大值,令 N=n_max。创建 N
    > 个全局变量,全局变量的命名可以为保留字段,也可以为随机字符串,例如{"G1","G2","G3",…,"GN"},并添加到集合U中。
    >
    > 2、串行执行任务节点task_A,并找到需要传递参数的下游任务节点task_B,若task_A不需要传递参数,则跳过以下步骤。
    >
    > 3、将task_B中所需接收的M个参数与集合U中取出的M个全局变量构成映射关系{"Param1":"Gu+1", "Param2":"Gu+2",
    > "Param3":"Gu+3", …., "ParamM":"Gu+M"},M<=N;
    >
    > 4、task_A执行完后,将参数根据步骤3中构建的映射关系同步至全局变量。
    >
    > 6、task_B执行之前,将全局变量根据步骤3中构建的映射关系同步至参数。
    >
    > 7、task_B执行完后,将参数对应全局变量重新添加回集合U中。
    >
    > 例子:
    >
    > [cid:2F0B0A06@A548D92D.B218FC5E]
    >
    > 执行流程如上图所示
    >
    > [cid:2E0C0506@8E291F04.B218FC5E]
    >
    > 参数传递过程如上图所示
    >
    > A节点传递a,b两个参数至C节点;
    >
    > B节点传递c,d,e三个参数至C节点;
    >
    > 1、N=5,U={G1,G2,G3,G4,G5};
    >
    > 2、执行A任务节点,找到C节点为下游;
    >
    > 3、得到映射{a:G1, b:G2};剩余U={G3,G4,G5};
    >
    > 4、A执行完后,同步a、b至G1、G2;
    >
    > 5、执行B任务节点,找到C节点为下游;
    >
    > 6、得到映射{c:G3, d:G4, e:G5};剩余U={};
    >
    > 7、B执行完后,同步c、d、e至G3、G4、G5;
    >
    > 8、执行C任务节点,同步G1、G2、G3、G4、G5至a、b、c、d、e;U={G1,G2,G3,G4,G5};
    >
    >
    >

Re: 任务结果/变量传递方案讨论

Posted by lidong dai <da...@gmail.com>.
hi ,
 the pic can't been seen,  please upload the pic to github , and show the
pic url here, by the way, as a global project , English is needed, you can
add English Desc use Google/Baidu translate tool.



Best Regards
---------------
DolphinScheduler(Incubator) PPMC
Lidong Dai 代立冬
dailidong66@gmail.com
---------------


wu shaoj <ga...@apache.org> 于2020年7月1日周三 下午8:29写道:

> Could you translate your solution in english ?
>
> From: Mad <10...@qq.com>
> Reply-To: "dev@dolphinscheduler.apache.org" <
> dev@dolphinscheduler.apache.org>
> Date: Wednesday, July 1, 2020 at 13:03
> To: dev <de...@dolphinscheduler.apache.org>
> Subject: 任务结果/变量传递方案讨论
>
>
> 技术方案:
>
> 核心:
>
> 通过动态修改若干个全局变量的值,来进行任务节点之间的“任务结果/变量传递”;
>
> 原因:
>
> 1、全局变量对任务流程中的所有任务节点可见。
>
> 2、任务实例执行流程为串行,则任务节点之间的任务结果和变量的传递不会同时发生,可以很容易地和已分配的有限个全局变量构成映射关系。
>
> 具体实现:
>
> 1、顺序遍历一次任务节点,并初始化全局参数个数 n=0,当遍历到传递参数的上游节点,则 n+=需要传递参数的个数,遍历到传递参数的下游节点,则 n
> -=需要传递参数的个数,记录该过程中 n 的最大值,令 N=n_max。创建 N
> 个全局变量,全局变量的命名可以为保留字段,也可以为随机字符串,例如{"G1","G2","G3",…,"GN"},并添加到集合U中。
>
> 2、串行执行任务节点task_A,并找到需要传递参数的下游任务节点task_B,若task_A不需要传递参数,则跳过以下步骤。
>
> 3、将task_B中所需接收的M个参数与集合U中取出的M个全局变量构成映射关系{"Param1":"Gu+1", "Param2":"Gu+2",
> "Param3":"Gu+3", …., "ParamM":"Gu+M"},M<=N;
>
> 4、task_A执行完后,将参数根据步骤3中构建的映射关系同步至全局变量。
>
> 6、task_B执行之前,将全局变量根据步骤3中构建的映射关系同步至参数。
>
> 7、task_B执行完后,将参数对应全局变量重新添加回集合U中。
>
> 例子:
>
> [cid:2F0B0A06@A548D92D.B218FC5E]
>
> 执行流程如上图所示
>
> [cid:2E0C0506@8E291F04.B218FC5E]
>
> 参数传递过程如上图所示
>
> A节点传递a,b两个参数至C节点;
>
> B节点传递c,d,e三个参数至C节点;
>
> 1、N=5,U={G1,G2,G3,G4,G5};
>
> 2、执行A任务节点,找到C节点为下游;
>
> 3、得到映射{a:G1, b:G2};剩余U={G3,G4,G5};
>
> 4、A执行完后,同步a、b至G1、G2;
>
> 5、执行B任务节点,找到C节点为下游;
>
> 6、得到映射{c:G3, d:G4, e:G5};剩余U={};
>
> 7、B执行完后,同步c、d、e至G3、G4、G5;
>
> 8、执行C任务节点,同步G1、G2、G3、G4、G5至a、b、c、d、e;U={G1,G2,G3,G4,G5};
>
>
>

Re: 任务结果/变量传递方案讨论

Posted by wu shaoj <ga...@apache.org>.
Could you translate your solution in english ?

From: Mad <10...@qq.com>
Reply-To: "dev@dolphinscheduler.apache.org" <de...@dolphinscheduler.apache.org>
Date: Wednesday, July 1, 2020 at 13:03
To: dev <de...@dolphinscheduler.apache.org>
Subject: 任务结果/变量传递方案讨论


技术方案:

核心:

通过动态修改若干个全局变量的值,来进行任务节点之间的“任务结果/变量传递”;

原因:

1、全局变量对任务流程中的所有任务节点可见。

2、任务实例执行流程为串行,则任务节点之间的任务结果和变量的传递不会同时发生,可以很容易地和已分配的有限个全局变量构成映射关系。

具体实现:

1、顺序遍历一次任务节点,并初始化全局参数个数 n=0,当遍历到传递参数的上游节点,则 n+=需要传递参数的个数,遍历到传递参数的下游节点,则 n -=需要传递参数的个数,记录该过程中 n 的最大值,令 N=n_max。创建 N 个全局变量,全局变量的命名可以为保留字段,也可以为随机字符串,例如{"G1","G2","G3",…,"GN"},并添加到集合U中。

2、串行执行任务节点task_A,并找到需要传递参数的下游任务节点task_B,若task_A不需要传递参数,则跳过以下步骤。

3、将task_B中所需接收的M个参数与集合U中取出的M个全局变量构成映射关系{"Param1":"Gu+1", "Param2":"Gu+2", "Param3":"Gu+3", …., "ParamM":"Gu+M"},M<=N;

4、task_A执行完后,将参数根据步骤3中构建的映射关系同步至全局变量。

6、task_B执行之前,将全局变量根据步骤3中构建的映射关系同步至参数。

7、task_B执行完后,将参数对应全局变量重新添加回集合U中。

例子:

[cid:2F0B0A06@A548D92D.B218FC5E]

执行流程如上图所示

[cid:2E0C0506@8E291F04.B218FC5E]

参数传递过程如上图所示

A节点传递a,b两个参数至C节点;

B节点传递c,d,e三个参数至C节点;

1、N=5,U={G1,G2,G3,G4,G5};

2、执行A任务节点,找到C节点为下游;

3、得到映射{a:G1, b:G2};剩余U={G3,G4,G5};

4、A执行完后,同步a、b至G1、G2;

5、执行B任务节点,找到C节点为下游;

6、得到映射{c:G3, d:G4, e:G5};剩余U={};

7、B执行完后,同步c、d、e至G3、G4、G5;

8、执行C任务节点,同步G1、G2、G3、G4、G5至a、b、c、d、e;U={G1,G2,G3,G4,G5};