You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@dolphinscheduler.apache.org by Jiajie Zhong <zh...@hotmail.com> on 2021/09/28 03:42:16 UTC

[PROPOSAL] Add Python API implementation of workflows-as-code

Hey guys,

    Apache DolphinScheduler is a good tool for workflow scheduler, it’s easy-to-extend,
distributed and have nice UI to create and maintain workflow. Our workflow only support
define in UI, which is easy to use and user friendly, it’s good but could be batter by 
adding extend API and make workflow could define as code or yaml file. And consider yaml 
file it’s hard to maintain manually I think it better to use code to define it, aka workflows-as-code.

    When workflow definitions as code, we could easy to modify some configure and do
some batch change for it. It’s could more easy to define similar task by loop statement,
and it give ability adding unittest for workflow too. I hope Apache DolphinScheduler could
combine the benefit of define by code and by UI, so I raise proposal for adding
workflows-as-code to Apache DolphinScheduler.

    Actually, I already start it by adding POC PR[1]. In this PR, I adding Python API give
user define workflow by Python code. This feature use *Py4J* connect Java and Python,
which mean I never add any new database model and infra to Apache DolphinScheduler,
I just reuse layer service in dolphinscheduler-api package to create workflow. And we could
consider Python API just another interface for Apache DolphinScheduler, just like our UI, it
allow we define and maintain workflow follow their rule.

    Here it’s an tutorial workflow definitions by Python API, which you could find it in PR file[2]

```python
from pydolphinscheduler.core.process_definition import ProcessDefinition
from pydolphinscheduler.tasks.shell import Shell

with ProcessDefinition(name="tutorial") as pd:
    task_parent = Shell(name="task_parent", command="echo hello pydolphinscheduler")
    task_child_one = Shell(name="task_child_one", command="echo 'child one'")
    task_child_two = Shell(name="task_child_two", command="echo 'child two'")
    task_union = Shell(name="task_union", command="echo union")

    task_group = [task_child_one, task_child_two]
    task_parent.set_downstream(task_group)

    task_union << task_group

    pd.run()
```

    In tutorial, we define a new ProcessDefinition named ‘tutorial’ using python context,
and then we add four Shell tasks to ‘tutorial’, just five line we could create one process
definition with four tasks.
    Beside process definition and tasks, another think we have to
add to workflow it’s task dependent, we add function `set_downstream` and `set_upstream`
to describe task dependent. At the same time, we overwrite bit operator and add a shortcut
`>>` and  `<<` to do it.
   After dependent set, we done our workflow definition, but all definition are in Python API
side, which mean it not persist to Apache DolphinScheduler database, and it could not runs
by Apache DolphinScheduler until declare `pd.submit()` or directly run it by `pd.run()`


[1]: https://github.com/apache/dolphinscheduler/pull/6269 <https://github.com/apache/dolphinscheduler/pull/6269>
[2]: https://github.com/apache/dolphinscheduler/pull/6269/files#diff-5561fec6b57cc611bee2b0d8f030965d76bdd202801d9f8a1e2e74c21769bc41 <https://github.com/apache/dolphinscheduler/pull/6269/files#diff-5561fec6b57cc611bee2b0d8f030965d76bdd202801d9f8a1e2e74c21769bc41>


Best Wish
— Jiajie




Re: ������: [PROPOSAL] Add Python API implementation of workflows-as-code

Posted by Jiajie Zhong <zh...@apache.org>.
Hi junfan,

>   1.  Could you please provide some spark/flink process examples?

In this POC, we would just create an example for Shell task, to make sure all basic component are good to work. We would support all tasks in DolphinScheduler UI laster, and all of the tasks would have examples, but sorry that should not support for this PR. If you interesting in spark/flink tasks, please join us develop, I will update all we have to do in DSIP issue[1], your could pick one feature you interesting to do it.

>   2.  I'm confused with workflow-as-code, you means it just define the DAG and workflow parameters? Could we combine workflow and user task code(like spark/flink programs)?

Sorry I not quit understand what's your `user task code` mean. But it's just define workflow, including task type, name, parameters, dependent etc. It would pass all things you have to define in UI to DolphinScheduler, and create ProcessDefinition and TaskDefinition so on.

But if you have another idea about it, please share it to me

[1]: https://github.com/apache/dolphinscheduler/issues/6407

Best Wish
― Jiajie

回复: [PROPOSAL] Add Python API implementation of workflows-as-code

Posted by zhang junfan <ju...@outlook.com>.
Good job, thanks focusing on multi-lang support.

Minor discussion.

  1.  Could you please provide some spark/flink process examples?
  2.  I'm confused with workflow-as-code, you means it just define the DAG and workflow parameters? Could we combine workflow and user task code(like spark/flink programs)?

________________________________
发件人: Jiajie Zhong <zh...@hotmail.com>
发送时间: 2021年9月28日 11:42
收件人: dev@dolphinscheduler.apache.org <de...@dolphinscheduler.apache.org>
主题: [PROPOSAL] Add Python API implementation of workflows-as-code

Hey guys,

    Apache DolphinScheduler is a good tool for workflow scheduler, it’s easy-to-extend,
distributed and have nice UI to create and maintain workflow. Our workflow only support
define in UI, which is easy to use and user friendly, it’s good but could be batter by
adding extend API and make workflow could define as code or yaml file. And consider yaml
file it’s hard to maintain manually I think it better to use code to define it, aka workflows-as-code.

    When workflow definitions as code, we could easy to modify some configure and do
some batch change for it. It’s could more easy to define similar task by loop statement,
and it give ability adding unittest for workflow too. I hope Apache DolphinScheduler could
combine the benefit of define by code and by UI, so I raise proposal for adding
workflows-as-code to Apache DolphinScheduler.

    Actually, I already start it by adding POC PR[1]. In this PR, I adding Python API give
user define workflow by Python code. This feature use *Py4J* connect Java and Python,
which mean I never add any new database model and infra to Apache DolphinScheduler,
I just reuse layer service in dolphinscheduler-api package to create workflow. And we could
consider Python API just another interface for Apache DolphinScheduler, just like our UI, it
allow we define and maintain workflow follow their rule.

    Here it’s an tutorial workflow definitions by Python API, which you could find it in PR file[2]

```python
from pydolphinscheduler.core.process_definition import ProcessDefinition
from pydolphinscheduler.tasks.shell import Shell

with ProcessDefinition(name="tutorial") as pd:
    task_parent = Shell(name="task_parent", command="echo hello pydolphinscheduler")
    task_child_one = Shell(name="task_child_one", command="echo 'child one'")
    task_child_two = Shell(name="task_child_two", command="echo 'child two'")
    task_union = Shell(name="task_union", command="echo union")

    task_group = [task_child_one, task_child_two]
    task_parent.set_downstream(task_group)

    task_union << task_group

    pd.run()
```

    In tutorial, we define a new ProcessDefinition named ‘tutorial’ using python context,
and then we add four Shell tasks to ‘tutorial’, just five line we could create one process
definition with four tasks.
    Beside process definition and tasks, another think we have to
add to workflow it’s task dependent, we add function `set_downstream` and `set_upstream`
to describe task dependent. At the same time, we overwrite bit operator and add a shortcut
`>>` and  `<<` to do it.
   After dependent set, we done our workflow definition, but all definition are in Python API
side, which mean it not persist to Apache DolphinScheduler database, and it could not runs
by Apache DolphinScheduler until declare `pd.submit()` or directly run it by `pd.run()`


[1]: https://github.com/apache/dolphinscheduler/pull/6269 <https://github.com/apache/dolphinscheduler/pull/6269>
[2]: https://github.com/apache/dolphinscheduler/pull/6269/files#diff-5561fec6b57cc611bee2b0d8f030965d76bdd202801d9f8a1e2e74c21769bc41 <https://github.com/apache/dolphinscheduler/pull/6269/files#diff-5561fec6b57cc611bee2b0d8f030965d76bdd202801d9f8a1e2e74c21769bc41>


Best Wish
― Jiajie




Re: [PROPOSAL] Add Python API implementation of workflows-as-code

Posted by Jiajie Zhong <zh...@hotmail.com>.
I forgot about show the graphic for what we define in Python API. When we declare `pd.run()`,
all model will post to Apache DolphinScheduler server, it would create a dag graphic as below

                           --> task_child_one
                         /                               \
task_parent -->                                  -->  task_union
                         \                               /
                           --> task_child_two

You could also find this graphic in tutorial.py[1] too

[1]: https://github.com/apache/dolphinscheduler/pull/6269/files#diff-5561fec6b57cc611bee2b0d8f030965d76bdd202801d9f8a1e2e74c21769bc41


Best Wish
— Jiajie


> ```python
> from pydolphinscheduler.core.process_definition import ProcessDefinition
> from pydolphinscheduler.tasks.shell import Shell
> 
> with ProcessDefinition(name="tutorial") as pd:
>     task_parent = Shell(name="task_parent", command="echo hello pydolphinscheduler")
>     task_child_one = Shell(name="task_child_one", command="echo 'child one'")
>     task_child_two = Shell(name="task_child_two", command="echo 'child two'")
>     task_union = Shell(name="task_union", command="echo union")
> 
>     task_group = [task_child_one, task_child_two]
>     task_parent.set_downstream(task_group)
> 
>     task_union << task_group
> 
>     pd.run()
> ```


Re: [PROPOSAL] Add Python API implementation of workflows-as-code

Posted by Jiajie Zhong <zh...@hotmail.com>.
Hi guys,

I am here to synchronize the progress of workflow-as-code. Up to now,
we could craete a simple workflow with Shell task, but without setting
schedule time for it. Could also set dependent by dependence operator
`>>` or `<<`, which is shortcut for `set_upstream` or `set_downstream`

We have two example in our exmaples directory[1]. One of them named tutorial
, which is let you know what workflow-as-code basic concept and how it
work. Another named bulk_create, is about how to create workflow in batch
mode using just single declare file. BTW, some of our community use bulk_create
in them performance test to create multiple workflow behavior.

If you like this and want to use or contribute workflow-as-code, you could go
and see it's homepage in[2]. It will guide you how to install and develop it.

We also have some interesting WIP PR about workflow-as-code, such as add schedule
time[3], code-style and static check[4]

Please join us and build workflow-as-code if you interesting in it.

[1]: https://github.com/apache/dolphinscheduler/tree/dev/dolphinscheduler-python/pydolphinscheduler/examples
[2]: https://github.com/apache/dolphinscheduler/tree/dev/dolphinscheduler-python/pydolphinscheduler
[3]: https://github.com/apache/dolphinscheduler/pull/6664
[4]: https://github.com/apache/dolphinscheduler/pull/6679


Best Wish
— Jiajie

Re: [PROPOSAL] Add Python API implementation of workflows-as-code

Posted by Jiajie Zhong <zh...@hotmail.com>.
Hey guys,

  It great to notice you that PR[1] is close to go now. I would add some
document for it, describe who it work and how to contribute it this week.

Please join us if you are interested in this feature, and we have lots of task
wanted in issue[2], feel free to take and work on it, and let me know in
this mailing-list or issue[2]

[1]: https://github.com/apache/dolphinscheduler/pull/6269
[2]: https://github.com/apache/dolphinscheduler/issues/6407

Best Wish
— Jiajie


Re: [PROPOSAL] Add Python API implementation of workflows-as-code

Posted by David Dai <li...@apache.org>.
great job.
I think it will be more convenient for many users who often use code
for workflow orchestration


Best Regards



---------------
Apache DolphinScheduler PMC Chair
David Dai
lidongdai@apache.org
Linkedin: https://www.linkedin.com/in/dailidong
Twitter: @WorkflowEasy
---------------

On Tue, Sep 28, 2021 at 9:40 PM CalvinKirs <ac...@163.com> wrote:
>
> +1
> Sounds good.
>
>
> Best Wishes!
> CalvinKirs, Apache DolphinScheduler PMC
>
>
> On 09/28/2021 17:27,wenjun<we...@apache.org> wrote:
> It's OK.
>
> Jiajie Zhong <zh...@hotmail.com> 于2021年9月28日周二 下午2:31写道:
>
> Hi, wenjun, thank for you feedback, but I think it’s better to keep it
> in apache/dolphinshceduler repo.
> Because it depend on dolphinshceduler-api package and call Java service
> code, I think it
> hard to separate to different repo.
>
> I take a quick look at other Apache project, such as Flink and Spark,
> which using Py4J connect Java
> and Python, are keep Python API code in the same repo.
>
> Best Wish
> — Jiajie
>
>
> This feature is more like python SDK, do we need to create a new
> repository
> to maintain?
>

Re: [PROPOSAL] Add Python API implementation of workflows-as-code

Posted by CalvinKirs <ac...@163.com>.
+1
Sounds good.


Best Wishes!
CalvinKirs, Apache DolphinScheduler PMC


On 09/28/2021 17:27,wenjun<we...@apache.org> wrote:
It's OK.

Jiajie Zhong <zh...@hotmail.com> 于2021年9月28日周二 下午2:31写道:

Hi, wenjun, thank for you feedback, but I think it’s better to keep it
in apache/dolphinshceduler repo.
Because it depend on dolphinshceduler-api package and call Java service
code, I think it
hard to separate to different repo.

I take a quick look at other Apache project, such as Flink and Spark,
which using Py4J connect Java
and Python, are keep Python API code in the same repo.

Best Wish
— Jiajie


This feature is more like python SDK, do we need to create a new
repository
to maintain?


Re: [PROPOSAL] Add Python API implementation of workflows-as-code

Posted by wenjun <we...@apache.org>.
It's OK.

Jiajie Zhong <zh...@hotmail.com> 于2021年9月28日周二 下午2:31写道:

>     Hi, wenjun, thank for you feedback, but I think it’s better to keep it
> in apache/dolphinshceduler repo.
> Because it depend on dolphinshceduler-api package and call Java service
> code, I think it
> hard to separate to different repo.
>
>    I take a quick look at other Apache project, such as Flink and Spark,
> which using Py4J connect Java
> and Python, are keep Python API code in the same repo.
>
> Best Wish
> — Jiajie
>
>
> > This feature is more like python SDK, do we need to create a new
> repository
> > to maintain?
>

Re: [PROPOSAL] Add Python API implementation of workflows-as-code

Posted by Jiajie Zhong <zh...@hotmail.com>.
    Hi, wenjun, thank for you feedback, but I think it’s better to keep it in apache/dolphinshceduler repo.
Because it depend on dolphinshceduler-api package and call Java service code, I think it
hard to separate to different repo.

   I take a quick look at other Apache project, such as Flink and Spark, which using Py4J connect Java
and Python, are keep Python API code in the same repo.

Best Wish
— Jiajie


> This feature is more like python SDK, do we need to create a new repository
> to maintain?

Re: [PROPOSAL] Add Python API implementation of workflows-as-code

Posted by wenjun ruan <be...@gmail.com>.
+1
This feature is more like python SDK, do we need to create a new repository
to maintain?

Jiajie Zhong <zh...@hotmail.com> 于2021年9月28日周二 上午11:42写道:

> Hey guys,
>
>     Apache DolphinScheduler is a good tool for workflow scheduler, it’s
> easy-to-extend,
> distributed and have nice UI to create and maintain workflow. Our workflow
> only support
> define in UI, which is easy to use and user friendly, it’s good but could
> be batter by
> adding extend API and make workflow could define as code or yaml file. And
> consider yaml
> file it’s hard to maintain manually I think it better to use code to
> define it, aka workflows-as-code.
>
>     When workflow definitions as code, we could easy to modify some
> configure and do
> some batch change for it. It’s could more easy to define similar task by
> loop statement,
> and it give ability adding unittest for workflow too. I hope Apache
> DolphinScheduler could
> combine the benefit of define by code and by UI, so I raise proposal for
> adding
> workflows-as-code to Apache DolphinScheduler.
>
>     Actually, I already start it by adding POC PR[1]. In this PR, I adding
> Python API give
> user define workflow by Python code. This feature use *Py4J* connect Java
> and Python,
> which mean I never add any new database model and infra to Apache
> DolphinScheduler,
> I just reuse layer service in dolphinscheduler-api package to create
> workflow. And we could
> consider Python API just another interface for Apache DolphinScheduler,
> just like our UI, it
> allow we define and maintain workflow follow their rule.
>
>     Here it’s an tutorial workflow definitions by Python API, which you
> could find it in PR file[2]
>
> ```python
> from pydolphinscheduler.core.process_definition import ProcessDefinition
> from pydolphinscheduler.tasks.shell import Shell
>
> with ProcessDefinition(name="tutorial") as pd:
>     task_parent = Shell(name="task_parent", command="echo hello
> pydolphinscheduler")
>     task_child_one = Shell(name="task_child_one", command="echo 'child
> one'")
>     task_child_two = Shell(name="task_child_two", command="echo 'child
> two'")
>     task_union = Shell(name="task_union", command="echo union")
>
>     task_group = [task_child_one, task_child_two]
>     task_parent.set_downstream(task_group)
>
>     task_union << task_group
>
>     pd.run()
> ```
>
>     In tutorial, we define a new ProcessDefinition named ‘tutorial’ using
> python context,
> and then we add four Shell tasks to ‘tutorial’, just five line we could
> create one process
> definition with four tasks.
>     Beside process definition and tasks, another think we have to
> add to workflow it’s task dependent, we add function `set_downstream` and
> `set_upstream`
> to describe task dependent. At the same time, we overwrite bit operator
> and add a shortcut
> `>>` and  `<<` to do it.
>    After dependent set, we done our workflow definition, but all
> definition are in Python API
> side, which mean it not persist to Apache DolphinScheduler database, and
> it could not runs
> by Apache DolphinScheduler until declare `pd.submit()` or directly run it
> by `pd.run()`
>
>
> [1]: https://github.com/apache/dolphinscheduler/pull/6269 <
> https://github.com/apache/dolphinscheduler/pull/6269>
> [2]:
> https://github.com/apache/dolphinscheduler/pull/6269/files#diff-5561fec6b57cc611bee2b0d8f030965d76bdd202801d9f8a1e2e74c21769bc41
> <
> https://github.com/apache/dolphinscheduler/pull/6269/files#diff-5561fec6b57cc611bee2b0d8f030965d76bdd202801d9f8a1e2e74c21769bc41
> >
>
>
> Best Wish
> — Jiajie
>
>
>
>