You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Andrew Jones <an...@andrew-jones.com> on 2018/02/28 16:58:21 UTC

Running code before pipeline starts

Hi,

What is the best way to run code before the pipeline starts? Anything in the `main` function doesn't get called when the pipeline is ran on Dataflow via a template - only the pipeline. If you're familiar with Spark, then I'm thinking of code that might be ran in the driver.

Alternatively, is there a way I can run part of a pipeline first, then run another part once it's completed? Not sure that makes sense, so to illustrate with a poor attempt at an ascii diagram, if I have something like this:

           events
                 /\
             /        \
             |        group by key
             |                     |
             |        do some action
             |                    /
             |                /
 once action is complete,
process all original elements

I can presumably achieve this by having `do some action` either generating an empty side input or an empty PCollection which I can then use to create a PCollectionList along with the original and pass to Flatten.pCollections() before continuing. Not sure if that's the best way to do it though.

Thanks,
Andrew

Re: Running code before pipeline starts

Posted by Andrew Jones <an...@andrew-jones.com>.
Thanks Lukasz, went with the side input approach and it worked
perfectly!

On Wed, 28 Feb 2018, at 18:28, Lukasz Cwik wrote:
> You should use a side input and not an empty PCollection that you
> flatten.> 
> Since
> ReadA --> Flatten --> ParDo
> ReadB -/
> can be equivalently executed as:
> ReadA --> ParDo
> ReadB --> ParDo
> 
> Make sure you access the side input in case a runner evaluates the
> side input lazily.> 
> So your pipeline would look like:
> Create --> ParDo(DoAction) --> View.asSingleton() named X
> ... --> ParDo(ProcessElements).withSideInput(X) --> ...
> 
> An alternative would be to use CoGroupByKey to join the two streams
> since it is not possible to split the execution like I showed with
> Flatten. It is wasteful to add the CoGroupByKey but it is a lot less
> wasteful if you convert a preceding GroupByKey in your pipeline into a
> CoGroupByKey joining the two streams.> 
> On Wed, Feb 28, 2018 at 8:58 AM, Andrew Jones <andrew+beam@andrew-
> jones.com> wrote:>> Hi,
>> 
>>  What is the best way to run code before the pipeline starts?
>>  Anything in the `main` function doesn't get called when the pipeline
>>  is ran on Dataflow via a template - only the pipeline. If you're
>>  familiar with Spark, then I'm thinking of code that might be ran in
>>  the driver.>> 
>>  Alternatively, is there a way I can run part of a pipeline first,
>>  then run another part once it's completed? Not sure that makes
>>  sense, so to illustrate with a poor attempt at an ascii diagram, if
>>  I have something like this:>> 
>>             events
>>                   /\
>>               /        \
>>               |        group by key
>>               |                     |
>>               |        do some action
>>               |                    /
>>               |                /
>>   once action is complete,
>>  process all original elements
>> 
>>  I can presumably achieve this by having `do some action` either
>>  generating an empty side input or an empty PCollection which I can
>>  then use to create a PCollectionList along with the original and
>>  pass to Flatten.pCollections() before continuing. Not sure if that's
>>  the best way to do it though.>> 
>>  Thanks,
>>  Andrew


Re: Running code before pipeline starts

Posted by Lukasz Cwik <lc...@google.com>.
You should use a side input and not an empty PCollection that you flatten.

Since
ReadA --> Flatten --> ParDo
ReadB -/
can be equivalently executed as:
ReadA --> ParDo
ReadB --> ParDo

Make sure you access the side input in case a runner evaluates the side
input lazily.

So your pipeline would look like:
Create --> ParDo(DoAction) --> View.asSingleton() named X
... --> ParDo(ProcessElements).withSideInput(X) --> ...

An alternative would be to use CoGroupByKey to join the two streams since
it is not possible to split the execution like I showed with Flatten. It is
wasteful to add the CoGroupByKey but it is a lot less wasteful if you
convert a preceding GroupByKey in your pipeline into a CoGroupByKey joining
the two streams.

On Wed, Feb 28, 2018 at 8:58 AM, Andrew Jones <an...@andrew-jones.com>
wrote:

> Hi,
>
> What is the best way to run code before the pipeline starts? Anything in
> the `main` function doesn't get called when the pipeline is ran on Dataflow
> via a template - only the pipeline. If you're familiar with Spark, then I'm
> thinking of code that might be ran in the driver.
>
> Alternatively, is there a way I can run part of a pipeline first, then run
> another part once it's completed? Not sure that makes sense, so to
> illustrate with a poor attempt at an ascii diagram, if I have something
> like this:
>
>            events
>                  /\
>              /        \
>              |        group by key
>              |                     |
>              |        do some action
>              |                    /
>              |                /
>  once action is complete,
> process all original elements
>
> I can presumably achieve this by having `do some action` either generating
> an empty side input or an empty PCollection which I can then use to create
> a PCollectionList along with the original and pass to
> Flatten.pCollections() before continuing. Not sure if that's the best way
> to do it though.
>
> Thanks,
> Andrew
>