You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by data_nerd_666 <da...@gmail.com> on 2023/12/14 12:17:39 UTC

Can apache beam be used for control flow (ETL workflow)

Hi all,

I am new to apache beam, and am very excited to find beam in apache
community. I see lots of use cases of using apache beam for data flow
(process large amount of batch/streaming data). I am just wondering whether
I can use apache beam for control flow (ETL workflow). I don't mean the
spark/flink job in the ETL workflow, I mean the ETL workflow itself.
Because ETL workflow is also a DAG which is very similar as the abstraction
of apache beam, but unfortunately I didn't find such use cases on internet.
So I'd like to ask this question in beam community to confirm whether I can
use apache beam for control flow (ETL workflow). If yes, please let me know
some success stories of this. Thanks

Re: Can apache beam be used for control flow (ETL workflow)

Posted by Mikhail Khludnev <mk...@apache.org>.
Also Apache Hop provides some sort of integration to Beam. Hop is
divided to Workflows (you ask) and Pipelines (similar to Beam). As far as I
understand (!) Hop's workflows are not persistent, ie it can't recover from
a node failure like Airflow can.

On Thu, Dec 14, 2023 at 3:18 PM data_nerd_666 <da...@gmail.com> wrote:

> Hi all,
>
> I am new to apache beam, and am very excited to find beam in apache
> community. I see lots of use cases of using apache beam for data flow
> (process large amount of batch/streaming data). I am just wondering whether
> I can use apache beam for control flow (ETL workflow). I don't mean the
> spark/flink job in the ETL workflow, I mean the ETL workflow itself.
> Because ETL workflow is also a DAG which is very similar as the abstraction
> of apache beam, but unfortunately I didn't find such use cases on internet.
> So I'd like to ask this question in beam community to confirm whether I can
> use apache beam for control flow (ETL workflow). If yes, please let me know
> some success stories of this. Thanks
>
>
>
>

-- 
Sincerely yours
Mikhail Khludnev

Re: Can apache beam be used for control flow (ETL workflow)

Posted by Steve973 <st...@gmail.com>.
Wouldn't Apache camel be more appropriate for the orchestration aspect?
Then delegate to beam for processing?

On Sun, Dec 24, 2023, 8:47 AM data_nerd_666 <da...@gmail.com> wrote:

> Thanks Austin & Chad, but my use case is to use beam to do ETL workflow
> control, which seems different from your case. I would like to check
> whether anyone has used beam for this kind of use case and whether beam is
> a good choice.
>
> On Sat, Dec 23, 2023 at 12:58 AM Chad Dombrova <ch...@gmail.com> wrote:
>
>> Hi,
>> I'm the guy who gave the Movie Magic talk.  Since it's possible to write
>> stateful transforms with Beam, it is capable of some very sophisticated
>> flow control.   I've not seen a python framework that combines this with
>> streaming data nearly as well.  That said, there aren't a lot of great
>> working examples out there for transforms that do sophisticated flow
>> control, and I feel like we're always wrestling with differences in
>> behavior between the direct runner and Dataflow.  There was a thread about
>> polling patterns [1] on this list that never really got a satisfying
>> resolution.  Likewise, there was a thread about using an SDF with an
>> unbound source [2] that also didn't get fully resolved.
>>
>> [1] https://lists.apache.org/thread/nsxs49vjokcc5wkvdvbvsqwzq682s7qw
>> [2] https://lists.apache.org/thread/n3xgml0z8fok7101q79rsmdgp06lofnb
>>
>>
>>
>> On Sun, Dec 17, 2023 at 3:53 PM Austin Bennett <au...@apache.org> wrote:
>>
>>> https://beamsummit.org/sessions/event-driven-movie-magic/
>>>
>>> ^^ the question made me think of that use case.  Though, unclear how
>>> close it is to what you're thinking about.
>>>
>>> Cheers -
>>>
>>> On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user <
>>> user@beam.apache.org> wrote:
>>>
>>>> As Jan says, theoretically possible? Sure. That particular set of
>>>> operations? Overkill. If you don't have it already set up I'd say even
>>>> something like Airflow is overkill here. If all you need to do is "launch
>>>> job and wait" when a file arrives... that's a small script and not
>>>> something that particularly requires a distributed data processing system.
>>>>
>>>> On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Apache Beam describes itself as "Apache Beam is an open-source,
>>>>> unified programming model for batch and streaming data processing
>>>>> pipelines, ...". As such, it is possible to use it to express essentially
>>>>> arbitrary logic and run it as a streaming pipeline. A streaming pipeline
>>>>> processes input data and produces output data and/or actions. Given these
>>>>> assumptions, it is technically feasible to use Apache Beam for
>>>>> orchestrating other workflows, the problem is that it will very much likely
>>>>> not be efficient. Apache Beam has a lot of heavy-lifting related to the
>>>>> fact it is designed to process large volumes of data in a scalable way,
>>>>> which is probably not what would one need for workflow orchestration. So,
>>>>> my two cents would be, that although it _could_ be done, it probably
>>>>> _should not_ be done.
>>>>>
>>>>> Best,
>>>>>
>>>>>  Jan
>>>>> On 12/15/23 13:39, Mikhail Khludnev wrote:
>>>>>
>>>>> Hello,
>>>>> I think this page
>>>>> https://beam.apache.org/documentation/ml/orchestration/ might answer
>>>>> your question.
>>>>> Frankly speaking: GCP Workflows and Apache Airflow.
>>>>> But Beam itself is a data-stream/flow or batch processor; not a
>>>>> workflow engine (IMHO).
>>>>>
>>>>> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I know it is technically possible, but my case may be a little
>>>>>> special. Say I have 3 steps for my control flow (ETL workflow):
>>>>>> Step 1. upstream file watching
>>>>>> Step 2. call some external service to run one job, e.g. run a
>>>>>> notebook, run a python script
>>>>>> Step 3. notify downstream workflow
>>>>>> Can I use apache beam to build a DAG with 3 nodes and run this as
>>>>>> either flink or spark job.  It might be a little weird, but I just want to
>>>>>> learn from the community whether this is the right way to use apache beam,
>>>>>> and has anyone done this before? Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
>>>>>> user@beam.apache.org> wrote:
>>>>>>
>>>>>>> It’s technically possible but the closest thing I can think of would
>>>>>>> be triggering things based on things like file watching.
>>>>>>>
>>>>>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Not using beam as time-based scheduler, but just use it to control
>>>>>>>> execution orders of ETL workflow DAG, because beam's abstraction is also a
>>>>>>>> DAG.
>>>>>>>> I know it is a little weird, just want to confirm with the
>>>>>>>> community, has anyone used beam like this before?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> can you give an example of what you mean for better understanding?
>>>>>>>>> Do
>>>>>>>>> you mean using Beam as a scheduler of other ETL workflows?
>>>>>>>>>
>>>>>>>>>   Jan
>>>>>>>>>
>>>>>>>>> On 12/14/23 13:17, data_nerd_666 wrote:
>>>>>>>>> > Hi all,
>>>>>>>>> >
>>>>>>>>> > I am new to apache beam, and am very excited to find beam in
>>>>>>>>> apache
>>>>>>>>> > community. I see lots of use cases of using apache beam for data
>>>>>>>>> flow
>>>>>>>>> > (process large amount of batch/streaming data). I am just
>>>>>>>>> wondering
>>>>>>>>> > whether I can use apache beam for control flow (ETL workflow). I
>>>>>>>>> don't
>>>>>>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL
>>>>>>>>> workflow
>>>>>>>>> > itself. Because ETL workflow is also a DAG which is very similar
>>>>>>>>> as
>>>>>>>>> > the abstraction of apache beam, but unfortunately I didn't find
>>>>>>>>> such
>>>>>>>>> > use cases on internet. So I'd like to ask this question in beam
>>>>>>>>> > community to confirm whether I can use apache beam for control
>>>>>>>>> flow
>>>>>>>>> > (ETL workflow). If yes, please let me know some success stories
>>>>>>>>> of
>>>>>>>>> > this. Thanks
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours
>>>>> Mikhail Khludnev
>>>>>
>>>>>

Re: Can apache beam be used for control flow (ETL workflow)

Posted by data_nerd_666 <da...@gmail.com>.
Thanks Austin & Chad, but my use case is to use beam to do ETL workflow
control, which seems different from your case. I would like to check
whether anyone has used beam for this kind of use case and whether beam is
a good choice.

On Sat, Dec 23, 2023 at 12:58 AM Chad Dombrova <ch...@gmail.com> wrote:

> Hi,
> I'm the guy who gave the Movie Magic talk.  Since it's possible to write
> stateful transforms with Beam, it is capable of some very sophisticated
> flow control.   I've not seen a python framework that combines this with
> streaming data nearly as well.  That said, there aren't a lot of great
> working examples out there for transforms that do sophisticated flow
> control, and I feel like we're always wrestling with differences in
> behavior between the direct runner and Dataflow.  There was a thread about
> polling patterns [1] on this list that never really got a satisfying
> resolution.  Likewise, there was a thread about using an SDF with an
> unbound source [2] that also didn't get fully resolved.
>
> [1] https://lists.apache.org/thread/nsxs49vjokcc5wkvdvbvsqwzq682s7qw
> [2] https://lists.apache.org/thread/n3xgml0z8fok7101q79rsmdgp06lofnb
>
>
>
> On Sun, Dec 17, 2023 at 3:53 PM Austin Bennett <au...@apache.org> wrote:
>
>> https://beamsummit.org/sessions/event-driven-movie-magic/
>>
>> ^^ the question made me think of that use case.  Though, unclear how
>> close it is to what you're thinking about.
>>
>> Cheers -
>>
>> On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user <
>> user@beam.apache.org> wrote:
>>
>>> As Jan says, theoretically possible? Sure. That particular set of
>>> operations? Overkill. If you don't have it already set up I'd say even
>>> something like Airflow is overkill here. If all you need to do is "launch
>>> job and wait" when a file arrives... that's a small script and not
>>> something that particularly requires a distributed data processing system.
>>>
>>> On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Hi,
>>>>
>>>> Apache Beam describes itself as "Apache Beam is an open-source, unified
>>>> programming model for batch and streaming data processing pipelines, ...".
>>>> As such, it is possible to use it to express essentially arbitrary logic
>>>> and run it as a streaming pipeline. A streaming pipeline processes input
>>>> data and produces output data and/or actions. Given these assumptions, it
>>>> is technically feasible to use Apache Beam for orchestrating other
>>>> workflows, the problem is that it will very much likely not be efficient.
>>>> Apache Beam has a lot of heavy-lifting related to the fact it is designed
>>>> to process large volumes of data in a scalable way, which is probably not
>>>> what would one need for workflow orchestration. So, my two cents would be,
>>>> that although it _could_ be done, it probably _should not_ be done.
>>>>
>>>> Best,
>>>>
>>>>  Jan
>>>> On 12/15/23 13:39, Mikhail Khludnev wrote:
>>>>
>>>> Hello,
>>>> I think this page
>>>> https://beam.apache.org/documentation/ml/orchestration/ might answer
>>>> your question.
>>>> Frankly speaking: GCP Workflows and Apache Airflow.
>>>> But Beam itself is a data-stream/flow or batch processor; not a
>>>> workflow engine (IMHO).
>>>>
>>>> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> I know it is technically possible, but my case may be a little
>>>>> special. Say I have 3 steps for my control flow (ETL workflow):
>>>>> Step 1. upstream file watching
>>>>> Step 2. call some external service to run one job, e.g. run a
>>>>> notebook, run a python script
>>>>> Step 3. notify downstream workflow
>>>>> Can I use apache beam to build a DAG with 3 nodes and run this as
>>>>> either flink or spark job.  It might be a little weird, but I just want to
>>>>> learn from the community whether this is the right way to use apache beam,
>>>>> and has anyone done this before? Thanks
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
>>>>> user@beam.apache.org> wrote:
>>>>>
>>>>>> It’s technically possible but the closest thing I can think of would
>>>>>> be triggering things based on things like file watching.
>>>>>>
>>>>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Not using beam as time-based scheduler, but just use it to control
>>>>>>> execution orders of ETL workflow DAG, because beam's abstraction is also a
>>>>>>> DAG.
>>>>>>> I know it is a little weird, just want to confirm with the
>>>>>>> community, has anyone used beam like this before?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> can you give an example of what you mean for better understanding?
>>>>>>>> Do
>>>>>>>> you mean using Beam as a scheduler of other ETL workflows?
>>>>>>>>
>>>>>>>>   Jan
>>>>>>>>
>>>>>>>> On 12/14/23 13:17, data_nerd_666 wrote:
>>>>>>>> > Hi all,
>>>>>>>> >
>>>>>>>> > I am new to apache beam, and am very excited to find beam in
>>>>>>>> apache
>>>>>>>> > community. I see lots of use cases of using apache beam for data
>>>>>>>> flow
>>>>>>>> > (process large amount of batch/streaming data). I am just
>>>>>>>> wondering
>>>>>>>> > whether I can use apache beam for control flow (ETL workflow). I
>>>>>>>> don't
>>>>>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL
>>>>>>>> workflow
>>>>>>>> > itself. Because ETL workflow is also a DAG which is very similar
>>>>>>>> as
>>>>>>>> > the abstraction of apache beam, but unfortunately I didn't find
>>>>>>>> such
>>>>>>>> > use cases on internet. So I'd like to ask this question in beam
>>>>>>>> > community to confirm whether I can use apache beam for control
>>>>>>>> flow
>>>>>>>> > (ETL workflow). If yes, please let me know some success stories
>>>>>>>> of
>>>>>>>> > this. Thanks
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>>
>>>>

Re: Can apache beam be used for control flow (ETL workflow)

Posted by Chad Dombrova <ch...@gmail.com>.
Hi,
I'm the guy who gave the Movie Magic talk.  Since it's possible to write
stateful transforms with Beam, it is capable of some very sophisticated
flow control.   I've not seen a python framework that combines this with
streaming data nearly as well.  That said, there aren't a lot of great
working examples out there for transforms that do sophisticated flow
control, and I feel like we're always wrestling with differences in
behavior between the direct runner and Dataflow.  There was a thread about
polling patterns [1] on this list that never really got a satisfying
resolution.  Likewise, there was a thread about using an SDF with an
unbound source [2] that also didn't get fully resolved.

[1] https://lists.apache.org/thread/nsxs49vjokcc5wkvdvbvsqwzq682s7qw
[2] https://lists.apache.org/thread/n3xgml0z8fok7101q79rsmdgp06lofnb



On Sun, Dec 17, 2023 at 3:53 PM Austin Bennett <au...@apache.org> wrote:

> https://beamsummit.org/sessions/event-driven-movie-magic/
>
> ^^ the question made me think of that use case.  Though, unclear how close
> it is to what you're thinking about.
>
> Cheers -
>
> On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user <us...@beam.apache.org>
> wrote:
>
>> As Jan says, theoretically possible? Sure. That particular set of
>> operations? Overkill. If you don't have it already set up I'd say even
>> something like Airflow is overkill here. If all you need to do is "launch
>> job and wait" when a file arrives... that's a small script and not
>> something that particularly requires a distributed data processing system.
>>
>> On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Hi,
>>>
>>> Apache Beam describes itself as "Apache Beam is an open-source, unified
>>> programming model for batch and streaming data processing pipelines, ...".
>>> As such, it is possible to use it to express essentially arbitrary logic
>>> and run it as a streaming pipeline. A streaming pipeline processes input
>>> data and produces output data and/or actions. Given these assumptions, it
>>> is technically feasible to use Apache Beam for orchestrating other
>>> workflows, the problem is that it will very much likely not be efficient.
>>> Apache Beam has a lot of heavy-lifting related to the fact it is designed
>>> to process large volumes of data in a scalable way, which is probably not
>>> what would one need for workflow orchestration. So, my two cents would be,
>>> that although it _could_ be done, it probably _should not_ be done.
>>>
>>> Best,
>>>
>>>  Jan
>>> On 12/15/23 13:39, Mikhail Khludnev wrote:
>>>
>>> Hello,
>>> I think this page
>>> https://beam.apache.org/documentation/ml/orchestration/ might answer
>>> your question.
>>> Frankly speaking: GCP Workflows and Apache Airflow.
>>> But Beam itself is a data-stream/flow or batch processor; not a workflow
>>> engine (IMHO).
>>>
>>> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <da...@gmail.com>
>>> wrote:
>>>
>>>> I know it is technically possible, but my case may be a little special.
>>>> Say I have 3 steps for my control flow (ETL workflow):
>>>> Step 1. upstream file watching
>>>> Step 2. call some external service to run one job, e.g. run a notebook,
>>>> run a python script
>>>> Step 3. notify downstream workflow
>>>> Can I use apache beam to build a DAG with 3 nodes and run this as
>>>> either flink or spark job.  It might be a little weird, but I just want to
>>>> learn from the community whether this is the right way to use apache beam,
>>>> and has anyone done this before? Thanks
>>>>
>>>>
>>>>
>>>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
>>>> user@beam.apache.org> wrote:
>>>>
>>>>> It’s technically possible but the closest thing I can think of would
>>>>> be triggering things based on things like file watching.
>>>>>
>>>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Not using beam as time-based scheduler, but just use it to control
>>>>>> execution orders of ETL workflow DAG, because beam's abstraction is also a
>>>>>> DAG.
>>>>>> I know it is a little weird, just want to confirm with the community,
>>>>>> has anyone used beam like this before?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> can you give an example of what you mean for better understanding?
>>>>>>> Do
>>>>>>> you mean using Beam as a scheduler of other ETL workflows?
>>>>>>>
>>>>>>>   Jan
>>>>>>>
>>>>>>> On 12/14/23 13:17, data_nerd_666 wrote:
>>>>>>> > Hi all,
>>>>>>> >
>>>>>>> > I am new to apache beam, and am very excited to find beam in
>>>>>>> apache
>>>>>>> > community. I see lots of use cases of using apache beam for data
>>>>>>> flow
>>>>>>> > (process large amount of batch/streaming data). I am just
>>>>>>> wondering
>>>>>>> > whether I can use apache beam for control flow (ETL workflow). I
>>>>>>> don't
>>>>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL
>>>>>>> workflow
>>>>>>> > itself. Because ETL workflow is also a DAG which is very similar
>>>>>>> as
>>>>>>> > the abstraction of apache beam, but unfortunately I didn't find
>>>>>>> such
>>>>>>> > use cases on internet. So I'd like to ask this question in beam
>>>>>>> > community to confirm whether I can use apache beam for control
>>>>>>> flow
>>>>>>> > (ETL workflow). If yes, please let me know some success stories of
>>>>>>> > this. Thanks
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>>
>>>

Re: Can apache beam be used for control flow (ETL workflow)

Posted by Austin Bennett <au...@apache.org>.
https://beamsummit.org/sessions/event-driven-movie-magic/

^^ the question made me think of that use case.  Though, unclear how close
it is to what you're thinking about.

Cheers -

On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user <us...@beam.apache.org>
wrote:

> As Jan says, theoretically possible? Sure. That particular set of
> operations? Overkill. If you don't have it already set up I'd say even
> something like Airflow is overkill here. If all you need to do is "launch
> job and wait" when a file arrives... that's a small script and not
> something that particularly requires a distributed data processing system.
>
> On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi,
>>
>> Apache Beam describes itself as "Apache Beam is an open-source, unified
>> programming model for batch and streaming data processing pipelines, ...".
>> As such, it is possible to use it to express essentially arbitrary logic
>> and run it as a streaming pipeline. A streaming pipeline processes input
>> data and produces output data and/or actions. Given these assumptions, it
>> is technically feasible to use Apache Beam for orchestrating other
>> workflows, the problem is that it will very much likely not be efficient.
>> Apache Beam has a lot of heavy-lifting related to the fact it is designed
>> to process large volumes of data in a scalable way, which is probably not
>> what would one need for workflow orchestration. So, my two cents would be,
>> that although it _could_ be done, it probably _should not_ be done.
>>
>> Best,
>>
>>  Jan
>> On 12/15/23 13:39, Mikhail Khludnev wrote:
>>
>> Hello,
>> I think this page https://beam.apache.org/documentation/ml/orchestration/
>> might answer your question.
>> Frankly speaking: GCP Workflows and Apache Airflow.
>> But Beam itself is a data-stream/flow or batch processor; not a workflow
>> engine (IMHO).
>>
>> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <da...@gmail.com>
>> wrote:
>>
>>> I know it is technically possible, but my case may be a little special.
>>> Say I have 3 steps for my control flow (ETL workflow):
>>> Step 1. upstream file watching
>>> Step 2. call some external service to run one job, e.g. run a notebook,
>>> run a python script
>>> Step 3. notify downstream workflow
>>> Can I use apache beam to build a DAG with 3 nodes and run this as either
>>> flink or spark job.  It might be a little weird, but I just want to
>>> learn from the community whether this is the right way to use apache beam,
>>> and has anyone done this before? Thanks
>>>
>>>
>>>
>>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
>>> user@beam.apache.org> wrote:
>>>
>>>> It’s technically possible but the closest thing I can think of would be
>>>> triggering things based on things like file watching.
>>>>
>>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Not using beam as time-based scheduler, but just use it to control
>>>>> execution orders of ETL workflow DAG, because beam's abstraction is also a
>>>>> DAG.
>>>>> I know it is a little weird, just want to confirm with the community,
>>>>> has anyone used beam like this before?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> can you give an example of what you mean for better understanding? Do
>>>>>> you mean using Beam as a scheduler of other ETL workflows?
>>>>>>
>>>>>>   Jan
>>>>>>
>>>>>> On 12/14/23 13:17, data_nerd_666 wrote:
>>>>>> > Hi all,
>>>>>> >
>>>>>> > I am new to apache beam, and am very excited to find beam in apache
>>>>>> > community. I see lots of use cases of using apache beam for data
>>>>>> flow
>>>>>> > (process large amount of batch/streaming data). I am just wondering
>>>>>> > whether I can use apache beam for control flow (ETL workflow). I
>>>>>> don't
>>>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL
>>>>>> workflow
>>>>>> > itself. Because ETL workflow is also a DAG which is very similar as
>>>>>> > the abstraction of apache beam, but unfortunately I didn't find
>>>>>> such
>>>>>> > use cases on internet. So I'd like to ask this question in beam
>>>>>> > community to confirm whether I can use apache beam for control flow
>>>>>> > (ETL workflow). If yes, please let me know some success stories of
>>>>>> > this. Thanks
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>>

Re: Can apache beam be used for control flow (ETL workflow)

Posted by Byron Ellis via user <us...@beam.apache.org>.
As Jan says, theoretically possible? Sure. That particular set of
operations? Overkill. If you don't have it already set up I'd say even
something like Airflow is overkill here. If all you need to do is "launch
job and wait" when a file arrives... that's a small script and not
something that particularly requires a distributed data processing system.

On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský <je...@seznam.cz> wrote:

> Hi,
>
> Apache Beam describes itself as "Apache Beam is an open-source, unified
> programming model for batch and streaming data processing pipelines, ...".
> As such, it is possible to use it to express essentially arbitrary logic
> and run it as a streaming pipeline. A streaming pipeline processes input
> data and produces output data and/or actions. Given these assumptions, it
> is technically feasible to use Apache Beam for orchestrating other
> workflows, the problem is that it will very much likely not be efficient.
> Apache Beam has a lot of heavy-lifting related to the fact it is designed
> to process large volumes of data in a scalable way, which is probably not
> what would one need for workflow orchestration. So, my two cents would be,
> that although it _could_ be done, it probably _should not_ be done.
>
> Best,
>
>  Jan
> On 12/15/23 13:39, Mikhail Khludnev wrote:
>
> Hello,
> I think this page https://beam.apache.org/documentation/ml/orchestration/
> might answer your question.
> Frankly speaking: GCP Workflows and Apache Airflow.
> But Beam itself is a data-stream/flow or batch processor; not a workflow
> engine (IMHO).
>
> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <da...@gmail.com>
> wrote:
>
>> I know it is technically possible, but my case may be a little special.
>> Say I have 3 steps for my control flow (ETL workflow):
>> Step 1. upstream file watching
>> Step 2. call some external service to run one job, e.g. run a notebook,
>> run a python script
>> Step 3. notify downstream workflow
>> Can I use apache beam to build a DAG with 3 nodes and run this as either
>> flink or spark job.  It might be a little weird, but I just want to
>> learn from the community whether this is the right way to use apache beam,
>> and has anyone done this before? Thanks
>>
>>
>>
>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
>> user@beam.apache.org> wrote:
>>
>>> It’s technically possible but the closest thing I can think of would be
>>> triggering things based on things like file watching.
>>>
>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <da...@gmail.com>
>>> wrote:
>>>
>>>> Not using beam as time-based scheduler, but just use it to control
>>>> execution orders of ETL workflow DAG, because beam's abstraction is also a
>>>> DAG.
>>>> I know it is a little weird, just want to confirm with the community,
>>>> has anyone used beam like this before?
>>>>
>>>>
>>>>
>>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> can you give an example of what you mean for better understanding? Do
>>>>> you mean using Beam as a scheduler of other ETL workflows?
>>>>>
>>>>>   Jan
>>>>>
>>>>> On 12/14/23 13:17, data_nerd_666 wrote:
>>>>> > Hi all,
>>>>> >
>>>>> > I am new to apache beam, and am very excited to find beam in apache
>>>>> > community. I see lots of use cases of using apache beam for data
>>>>> flow
>>>>> > (process large amount of batch/streaming data). I am just wondering
>>>>> > whether I can use apache beam for control flow (ETL workflow). I
>>>>> don't
>>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL
>>>>> workflow
>>>>> > itself. Because ETL workflow is also a DAG which is very similar as
>>>>> > the abstraction of apache beam, but unfortunately I didn't find such
>>>>> > use cases on internet. So I'd like to ask this question in beam
>>>>> > community to confirm whether I can use apache beam for control flow
>>>>> > (ETL workflow). If yes, please let me know some success stories of
>>>>> > this. Thanks
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
>

Re: Can apache beam be used for control flow (ETL workflow)

Posted by Jan Lukavský <je...@seznam.cz>.
Hi,

Apache Beam describes itself as "Apache Beam is an open-source, unified 
programming model for batch and streaming data processing pipelines, 
...". As such, it is possible to use it to express essentially arbitrary 
logic and run it as a streaming pipeline. A streaming pipeline processes 
input data and produces output data and/or actions. Given these 
assumptions, it is technically feasible to use Apache Beam for 
orchestrating other workflows, the problem is that it will very much 
likely not be efficient. Apache Beam has a lot of heavy-lifting related 
to the fact it is designed to process large volumes of data in a 
scalable way, which is probably not what would one need for workflow 
orchestration. So, my two cents would be, that although it _could_ be 
done, it probably _should not_ be done.

Best,

  Jan

On 12/15/23 13:39, Mikhail Khludnev wrote:
> Hello,
> I think this page 
> https://beam.apache.org/documentation/ml/orchestration/ might answer 
> your question.
> Frankly speaking: GCP Workflows and Apache Airflow.
> But Beam itself is a data-stream/flow or batch processor; not a 
> workflow engine (IMHO).
>
> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <da...@gmail.com> 
> wrote:
>
>     I know it is technically possible, but my case may be a little
>     special. Say I have 3 steps for my control flow (ETL workflow):
>     Step 1. upstream file watching
>     Step 2. call some external service to run one job, e.g. run a
>     notebook, run a python script
>     Step 3. notify downstream workflow
>     Can I use apache beam to build a DAG with 3 nodes and run this as
>     either flink or spark job.  It might be a little weird, but I just
>     want to learn from the community whether this is the right way to
>     use apache beam, and has anyone done this before? Thanks
>
>
>
>     On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user
>     <us...@beam.apache.org> wrote:
>
>         It’s technically possible but the closest thing I can think of
>         would be triggering things based on things like file watching.
>
>         On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666
>         <da...@gmail.com> wrote:
>
>             Not using beam as time-based scheduler, but just use it to
>             control execution orders of ETL workflow DAG, because
>             beam's abstraction is also a DAG.
>             I know it is a little weird, just want to confirm with the
>             community, has anyone used beam like this before?
>
>
>
>             On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský
>             <je...@seznam.cz> wrote:
>
>                 Hi,
>
>                 can you give an example of what you mean for better
>                 understanding? Do
>                 you mean using Beam as a scheduler of other ETL workflows?
>
>                   Jan
>
>                 On 12/14/23 13:17, data_nerd_666 wrote:
>                 > Hi all,
>                 >
>                 > I am new to apache beam, and am very excited to find
>                 beam in apache
>                 > community. I see lots of use cases of using apache
>                 beam for data flow
>                 > (process large amount of batch/streaming data). I am
>                 just wondering
>                 > whether I can use apache beam for control flow (ETL
>                 workflow). I don't
>                 > mean the spark/flink job in the ETL workflow, I mean
>                 the ETL workflow
>                 > itself. Because ETL workflow is also a DAG which is
>                 very similar as
>                 > the abstraction of apache beam, but unfortunately I
>                 didn't find such
>                 > use cases on internet. So I'd like to ask this
>                 question in beam
>                 > community to confirm whether I can use apache beam
>                 for control flow
>                 > (ETL workflow). If yes, please let me know some
>                 success stories of
>                 > this. Thanks
>                 >
>                 >
>                 >
>
>
>
> -- 
> Sincerely yours
> Mikhail Khludnev

Re: Can apache beam be used for control flow (ETL workflow)

Posted by Mikhail Khludnev <mk...@apache.org>.
Hello,
I think this page https://beam.apache.org/documentation/ml/orchestration/
might answer your question.
Frankly speaking: GCP Workflows and Apache Airflow.
But Beam itself is a data-stream/flow or batch processor; not a workflow
engine (IMHO).

On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <da...@gmail.com> wrote:

> I know it is technically possible, but my case may be a little special.
> Say I have 3 steps for my control flow (ETL workflow):
> Step 1. upstream file watching
> Step 2. call some external service to run one job, e.g. run a notebook,
> run a python script
> Step 3. notify downstream workflow
> Can I use apache beam to build a DAG with 3 nodes and run this as either
> flink or spark job.  It might be a little weird, but I just want to
> learn from the community whether this is the right way to use apache beam,
> and has anyone done this before? Thanks
>
>
>
> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
> user@beam.apache.org> wrote:
>
>> It’s technically possible but the closest thing I can think of would be
>> triggering things based on things like file watching.
>>
>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <da...@gmail.com>
>> wrote:
>>
>>> Not using beam as time-based scheduler, but just use it to control
>>> execution orders of ETL workflow DAG, because beam's abstraction is also a
>>> DAG.
>>> I know it is a little weird, just want to confirm with the community,
>>> has anyone used beam like this before?
>>>
>>>
>>>
>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Hi,
>>>>
>>>> can you give an example of what you mean for better understanding? Do
>>>> you mean using Beam as a scheduler of other ETL workflows?
>>>>
>>>>   Jan
>>>>
>>>> On 12/14/23 13:17, data_nerd_666 wrote:
>>>> > Hi all,
>>>> >
>>>> > I am new to apache beam, and am very excited to find beam in apache
>>>> > community. I see lots of use cases of using apache beam for data flow
>>>> > (process large amount of batch/streaming data). I am just wondering
>>>> > whether I can use apache beam for control flow (ETL workflow). I
>>>> don't
>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL workflow
>>>> > itself. Because ETL workflow is also a DAG which is very similar as
>>>> > the abstraction of apache beam, but unfortunately I didn't find such
>>>> > use cases on internet. So I'd like to ask this question in beam
>>>> > community to confirm whether I can use apache beam for control flow
>>>> > (ETL workflow). If yes, please let me know some success stories of
>>>> > this. Thanks
>>>> >
>>>> >
>>>> >
>>>>
>>>

-- 
Sincerely yours
Mikhail Khludnev

Re: Can apache beam be used for control flow (ETL workflow)

Posted by data_nerd_666 <da...@gmail.com>.
I know it is technically possible, but my case may be a little special. Say
I have 3 steps for my control flow (ETL workflow):
Step 1. upstream file watching
Step 2. call some external service to run one job, e.g. run a notebook, run
a python script
Step 3. notify downstream workflow
Can I use apache beam to build a DAG with 3 nodes and run this as either
flink or spark job.  It might be a little weird, but I just want to
learn from the community whether this is the right way to use apache beam,
and has anyone done this before? Thanks



On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <us...@beam.apache.org>
wrote:

> It’s technically possible but the closest thing I can think of would be
> triggering things based on things like file watching.
>
> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <da...@gmail.com>
> wrote:
>
>> Not using beam as time-based scheduler, but just use it to control
>> execution orders of ETL workflow DAG, because beam's abstraction is also a
>> DAG.
>> I know it is a little weird, just want to confirm with the community, has
>> anyone used beam like this before?
>>
>>
>>
>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Hi,
>>>
>>> can you give an example of what you mean for better understanding? Do
>>> you mean using Beam as a scheduler of other ETL workflows?
>>>
>>>   Jan
>>>
>>> On 12/14/23 13:17, data_nerd_666 wrote:
>>> > Hi all,
>>> >
>>> > I am new to apache beam, and am very excited to find beam in apache
>>> > community. I see lots of use cases of using apache beam for data flow
>>> > (process large amount of batch/streaming data). I am just wondering
>>> > whether I can use apache beam for control flow (ETL workflow). I don't
>>> > mean the spark/flink job in the ETL workflow, I mean the ETL workflow
>>> > itself. Because ETL workflow is also a DAG which is very similar as
>>> > the abstraction of apache beam, but unfortunately I didn't find such
>>> > use cases on internet. So I'd like to ask this question in beam
>>> > community to confirm whether I can use apache beam for control flow
>>> > (ETL workflow). If yes, please let me know some success stories of
>>> > this. Thanks
>>> >
>>> >
>>> >
>>>
>>

Re: Can apache beam be used for control flow (ETL workflow)

Posted by Byron Ellis via user <us...@beam.apache.org>.
It’s technically possible but the closest thing I can think of would be
triggering things based on things like file watching.

On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <da...@gmail.com> wrote:

> Not using beam as time-based scheduler, but just use it to control
> execution orders of ETL workflow DAG, because beam's abstraction is also a
> DAG.
> I know it is a little weird, just want to confirm with the community, has
> anyone used beam like this before?
>
>
>
> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi,
>>
>> can you give an example of what you mean for better understanding? Do
>> you mean using Beam as a scheduler of other ETL workflows?
>>
>>   Jan
>>
>> On 12/14/23 13:17, data_nerd_666 wrote:
>> > Hi all,
>> >
>> > I am new to apache beam, and am very excited to find beam in apache
>> > community. I see lots of use cases of using apache beam for data flow
>> > (process large amount of batch/streaming data). I am just wondering
>> > whether I can use apache beam for control flow (ETL workflow). I don't
>> > mean the spark/flink job in the ETL workflow, I mean the ETL workflow
>> > itself. Because ETL workflow is also a DAG which is very similar as
>> > the abstraction of apache beam, but unfortunately I didn't find such
>> > use cases on internet. So I'd like to ask this question in beam
>> > community to confirm whether I can use apache beam for control flow
>> > (ETL workflow). If yes, please let me know some success stories of
>> > this. Thanks
>> >
>> >
>> >
>>
>

Re: Can apache beam be used for control flow (ETL workflow)

Posted by data_nerd_666 <da...@gmail.com>.
Not using beam as time-based scheduler, but just use it to control
execution orders of ETL workflow DAG, because beam's abstraction is also a
DAG.
I know it is a little weird, just want to confirm with the community, has
anyone used beam like this before?



On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <je...@seznam.cz> wrote:

> Hi,
>
> can you give an example of what you mean for better understanding? Do
> you mean using Beam as a scheduler of other ETL workflows?
>
>   Jan
>
> On 12/14/23 13:17, data_nerd_666 wrote:
> > Hi all,
> >
> > I am new to apache beam, and am very excited to find beam in apache
> > community. I see lots of use cases of using apache beam for data flow
> > (process large amount of batch/streaming data). I am just wondering
> > whether I can use apache beam for control flow (ETL workflow). I don't
> > mean the spark/flink job in the ETL workflow, I mean the ETL workflow
> > itself. Because ETL workflow is also a DAG which is very similar as
> > the abstraction of apache beam, but unfortunately I didn't find such
> > use cases on internet. So I'd like to ask this question in beam
> > community to confirm whether I can use apache beam for control flow
> > (ETL workflow). If yes, please let me know some success stories of
> > this. Thanks
> >
> >
> >
>

Re: Can apache beam be used for control flow (ETL workflow)

Posted by Jan Lukavský <je...@seznam.cz>.
Hi,

can you give an example of what you mean for better understanding? Do 
you mean using Beam as a scheduler of other ETL workflows?

  Jan

On 12/14/23 13:17, data_nerd_666 wrote:
> Hi all,
>
> I am new to apache beam, and am very excited to find beam in apache 
> community. I see lots of use cases of using apache beam for data flow 
> (process large amount of batch/streaming data). I am just wondering 
> whether I can use apache beam for control flow (ETL workflow). I don't 
> mean the spark/flink job in the ETL workflow, I mean the ETL workflow 
> itself. Because ETL workflow is also a DAG which is very similar as 
> the abstraction of apache beam, but unfortunately I didn't find such 
> use cases on internet. So I'd like to ask this question in beam 
> community to confirm whether I can use apache beam for control flow 
> (ETL workflow). If yes, please let me know some success stories of 
> this. Thanks
>
>
>